Professional Documents
Culture Documents
Infotehnoloogiateaduskond
Arvutitiehnika Instituut
Digitaaltehnika õppetool
Paralleelarhitektuurid
IAY0060
Õppejõud: K. Tammemäe
Üleõppelane: Valentin Tihhomirov
971081 LASM
Tallinn 2005
Contents
2 Intorduction 4
2.1 Brain research . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.2 Artificial NNs . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.3 Demand for the Neural HW . . . . . . . . . . . . . . . . . . . 10
3 Traditional Approach 12
3.1 Simulating Artificial Neural Networks on Parallel Architec-
tures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.2 Mapping neural networks on parallel machines . . . . . . . 12
3.3 Benchmarking . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.4 Simulation on General-Purpose Parallel Machines . . . . . . 15
3.5 Neurocomputers . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.6 FPGAs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.6.1 ANNs on RAPTOR2000 . . . . . . . . . . . . . . . . . 25
3.7 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
4 Spiking NNs 29
4.1 Theoretical background . . . . . . . . . . . . . . . . . . . . . 29
4.2 Sample HW . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
4.2.1 Learning at the Edge of Chaos . . . . . . . . . . . . . 30
4.2.2 MASPINN on NeuroPipe-Chip: A Digital Neuro-Processor 31
4.2.3 Analog VLSI for SNN . . . . . . . . . . . . . . . . . . 32
4.3 Maas-Markram theory: WetWare in Liquid Computer . . . . 33
4.3.1 The ‘Hard Liquid’ . . . . . . . . . . . . . . . . . . . . 36
4.4 The Blue Brain . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
4.4.1 Blue Gene . . . . . . . . . . . . . . . . . . . . . . . . . 37
4.4.2 Brain simulation on BG . . . . . . . . . . . . . . . . . 39
4.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
5 Conclusions 41
6 Epilogue 45
1
Chapter 1
What really motivated us to study the field was the announce [1] of Blue
Brain project 1 , the assent of IBM corp to grant their TOP15-listed BG/L
computer to the Brain Mind Institute at Switzerland’s Ecole Polytechnique
Fédérale de Lausanne (EPFL) for replicating ‘in silico’ one of the brain’s
building blocks, a neocortex column (NCC).
Henry Markram, the project leader, its initiator and founder of the Brain
Mind, explains:
The neocortical column is the beginning of intelligence and adapt-
ability marking the jump from reptiles to mammals. When it
evolved, it was like Mother Nature had discovered the Pentium
chip. The circuitry was so successful that it’s just duplicated,
with very little variation, from mouse to man. In the human
cortex, there are just more cortical columns — about 1 million.
Over the past 10 years, Markram’s laboratory has developed new tech-
niques for multi-neuron patch-clamp recordings producing highly quan-
titative data on the electrophysiology and anatomy of the different types
of neurons and the connections they form. The obtained data give al-
most complete digital description of microstructure, operation and learn-
ing function making it possible to begin the reconstruction of the NCC in
SW.
The BB goal is merely to build a simulacrum of a biological brain. It
is achieved when the outputs produced by the simulation in response to
particular inputs are identical to the ‘wet’ experiments. If that works, two
directions are planned. Once cellular model of NCC is debugged and opti-
mized, the BG/L will be replaced by a HW chip, which is easy to replicate
in millions for simulation of the whole brain. The second track will be to
1 http://bluebrainproject.epfl.ch/
2
work at more elementary level – to simulate the brain at molecular level
and to look at the role of genes in the brain function.
Replacing the ‘in vivo’ experiments by ’in vitro’ simulation would turn
years of brain research in days, save huge funds and lab-animals. What is
much more interesting is the hope that the project will shed some light on
the emergence of consciousness. Scientists have no purchase on this eva-
sive phenomena at all but you have to start somewhere, quips Markram.
It is not the first attempt to build a computer model of the brain1 . This
time however, it is launched by neuro- and computer science world leaders,
so we take it seriously as the most ambitious project ever conducted in
neuroscience.
Figure 1.1: What kind of HW can effectively rely the pure connectivity of
these parallel ‘computing threads’, which are 3D and morphing at that?
Any ideas on decomposition? [A video frame from the BB site]
3
Chapter 2
Intorduction
Anecdote
Galileo
In this year, 2006, the World celebrates a century since Spanish histol-
ogist Santiago Ramón Cajal was rewarded with the Nobel Prize for pio-
neering the field of cellular neuoroscience through his research into the
microscopic properties of the brain. He is credited to be the founder of
modern neuroscience after discovering the structure of brain cortical layers
composed of millions of individual cells (neurons), which communicate via
specialized junctions (synapses).
The neocortex constitutes about 85 % of the human brain’s total mass
and is thought to be responsible for the cognitive functions of language,
4
learning, memory and complex thought. It is also responsible for the ‘mira-
cles of thought’ making people creative, inventive and philosophical to ask
such questions. During the century of neuroscience research, it have been
discovered that neocortex’s neurons are organized into columns. These
cylindrical structures are about 0.5 mm in diameter and 2–4 mm high but
pack inside up to 60 000 neurons, each has 10 000 connections with others
producing 5 km of cabling1 . In fact, what we call the the ‘gray matter’ is
just a thin surface of neuron bodies that cover the white matter, the insu-
lated cabling.
Any biological research contributes to the technology and science. Look-
ing at the finesse of the life creatures, we wonder their beauty so much that
even 200 years after Darvin’s publication some cannot believe that the un-
intelligent random process under pressure of natural selection can gener-
ate such perfection [2]. Throughout the history, people draw the resources
from the Nature and the inspiration from its evolution-optimized appli-
ances, like wings and silk. The computers were invented as a byproduct
in the attempt to formalize the consciousness and computability, started by
Gilbert in the beginning of XX century. Von Neumann derived the com-
puter from the theoretic Turing machine. This machine executes a pre-
scribed algorithm at speeds as high as 109 op/sec. It is astonishing how
much can be done by simple (in terms of complexity theory) algorithms
running automatically. Throughout the 20th century, mankind has devel-
oped communication and information processing technology entering the
‘information society’. The very idea of evolution was adopted by computer
technology in forms of OOP and genetic algorithms. Looking for more ad-
vanced computation techniques, mankind has finally resorted to the secrets
of the brain.
Despite almighty, the biological evolution is useless during the life of
its creatures since genes perform slowly. But the pressure to react immedi-
ately in the rapidly changing environment forced them create the nervous
system to prompt
1. adequate solutions
2. in real-time
3. by analyzing incomplete and controversial sensory information.
The neural networks (NN) turn out good where it is difficult or impossible
to solve by math. Unlike traditional computer methods, they learn by ex-
ample rather than solve by algorithm. Exactly these adaptation capabilities
outwitted the large-toothed enemies, discovered and subordinated forces
of nature and, finally, try to understand themselves.
It is time to take over the most powerful tool in the nature created by
millions years of evolution. Likewise computer science studies combin-
1 http://bluebrainproject.epfl.ch/TheNeocorticalColumn.htm
5
ing the logic gates to get a computer, neuroscience studies how the brain
is made of neurons. Its neuroinformatics branch uses mathematical and
computational techniques such as simulation to understand the function of
nervous system. But is not still clear that the goal is reachable.
The issue is that, despite it gave us the computer, Gilbert’s program
on mathematics fundamentals has failed: it was shown that some ‘truths’,
which can be ‘seen’ by a human, cannot be deduced by a formal algorithm
(a computer). The examples of such true sentences are the Turing’s Halt-
ing problem and the identical Gëdel sentence “P = (P cannot be proved)”.
Some argue that the brain must necessarily possess a degree of random-
ness2 in the struggle for survival because its purpose is to ‘deceive’ the en-
emy whereas predictability means ‘defeat’. The random generators, being
capable of transcending the deduction of formal logic, are also attributed as
creativity tools. In [3], this is presented as entertaining story, which points
to the mind as it is that very God’s stone, which might be created but can-
not be understood. Penrose agrees that algorithmic computation exists to
weed out suboptimal solutions generated at random. He locates the source
of randomness from the theoretical physics standpoint [4]. Penrose sees
a gap between the micro- and macro- worlds considering the mind as a
bridge, which performs the irreversible quantum function reduction thus
joining the material world with the ideal world of Plato. Anybody who
starts learning neural networks and quantum theory notes this similarity in
their magic to traverse the huge problem space quickly3 . As one argument,
Penrose points out that the ideas, no matter how big they are, are compre-
hended as holistic pictures like the huge ensembles of atoms consolidate in
pseudo-crystals. Promoting his own TOE4 , the prominent scientist attacks
formalists’ ‘strong AI’ not saying a word about analogue computation nor
connectionism. Whatever who is right, the tremendous success of computer
in XX century suggests that the neuroinformatic’s research will be the main
challenge of the 21st century.
Let us start with the parallel architecture of the brain, which is capable
to recognize mother’s image in 1/ 100 sec, while operating at frequencies
as low as 103 Hz; that is, less than in 100 steps [5].
6
mentary computers, the neurons, massively interconnected some topology.
These structures are distinguished by utmost parallelism in data storage
and processing, the capabilities to generalize and learn and their tolerance
to incomplete and inaccurate learning information. These unique features
make the ANNs applicable to classification, clasterization/categorization
(the classification without a teacher), prediction, optimization and control
tasks.
y = boolean(s > T ).
Perceptrons
Practical implementation of the model became possible 20 years later.
At the same time the binary model was extended by Rosenblatt developing
perceptrons. The neuron is characterized by weight vector W and the form
of activation function f , which produces activation value y:
y = f (W · X ) = f ( s ) .
The signals and weights are real numbers and the activation function turns
into a smooth threshold like sigmoid 1/(1 + e−ax ) or arctan.
It is convenient to consider NN as a directed graph with weighted edges.
Its neurons, the network nodes, are divided into three groups: input, output
7
(a) Artificial neuron (b) Examples of Activation
function
(result) and hidden (intermediate) ones. The loop-less graphs are called di-
rect propagation networks or perceptrons and recurrent otherwise.
Many ANN tasks, are boiled down to classification: any input is mapped
to a given set of classes. Geometrically interpreting, this corresponds to
breaking space of solutions (or attributes) into domains by hyperplanes.
Rosenblatt’s perceptron consisted of one layer. The hyperplane for a two-
input single-layer perceptron (a neuron) with a steep threshold function is a
line x1 w1 + x2 w2 ≶ T. Bisecting a plane, it allows to implement AND and OR
binary functions, but not XOR. The XOR problem revealing limited capabil-
ities of perceptrons was pointed out by Minsky and Peppert [6] suggesting
that a ‘hidden layer’ would resolve the linear separability problem.
The graph output depends on its topology and synaptic weights. The
procedure of weight adjustment is said to be learing (or training). In the
one-layer perceptron learning, one of input vectors Xt is submitted to the
network and output vector Yt is analyzed. At iteration t, all input i weights
wij of all the neurons j are adjusted according to the error ∆ = Rt − Yt
between Yt and supervisor provided reference Rt : wij (t + 1) = wij (t) + k ·
∆ j xij , where k ∈ [0, 1] is a learning speed factor. The procedure is repeated
until network answers have converged to the references. Obviously, this
technique is not applicable to the multilayer perceptron, because its hidden
layer outputs are unknown. In their work, Minsky/Pappert conjectured
8
that there such learning is infeasible5 effectively abridging enthusiasm and
funding of ANN research for the next 20 years.
With invention of backward propagation learning, multilayer percep-
trons gained popularity. Its essence is gradient-decent minimization of
the network error square mean E = ∑(y j − d j )2 . The weights wij con-
necting ith neuron of layer n with jth neuron of layer n + 1 are adjusted
as ∆wij = −k · ∂E/∂wij . The derivative exploits the smooth activation
function. The algorithm is not free of deficiencies. The high weights may
shift the working point of sigmoids into saturation area. Additionally, high
learning speed causes instability; it, therefore, is slowed down and network
stops learning — the gradient decent is likely trapped in local mimima here.
Another algorithm, simulated annealing, performs better.
Self-Organizing Models
It is also possible to learn without the training information. The self-
organization capability is a cornerstone feature of all alive systems, includ-
ing nerve cells. As far as ANNs are concerned, the self-organization means
adjustment of weighting factors, a number of neurons and network topol-
ogy. For simplicity, only weights are adjusted.
One such approach is the Hebbian learning. It reflects a known neu-
robiological fact: if two connected neurons are excited simultaneously and
regularly, their connection becomes stronger. Mathematically, this is ex-
pressed as ∆wij = k xij (t) y j (t), where xij = yi and y j are outputs of jth and
ith neuron.
Building a one-layer fully recurrent NN, which size matches the input
vector (object) length and weights are programmed by Hebb algorithm, we
get the epochal Hopfield model. Initialized the state by input, it is iterated
Xt+1 = F (W · Xt ) until convergence. The state is effectively “attracted” to
one of the synapse-predefined states accomplishing the classification. Ba-
sically, the system energy E = 0, 5 · ∑ wij xi x j is minimized. The neural
associative memory (NAM) operates similarly — the input vectors are pre-
defined in pair with their output.
Kohonen self-organizing maps (SOMs) minimize the difference between
neuron’s input and its synaptic weight: ∆wij = k ( xi − wij ). In contrast to
Hebbian’s algorithm, the weights are adjusted not for all neurons, rather
the group around the strongest input neuron. Such a principle is known as
learning by competition. At the beginning, the neighborhood is set as large
as 2/3 of the network and is shrinked down to a single neuron during the
course. This shapes the network so that the close input signals correspond
to close neurons effectively implementing categorization task.
5 http://ece-www.colorado.edu/
~ecen4831/lectures/NNet3.html
9
2.3 Demand for the Neural HW
The maturity of some NN algorithms and importance of their intrinsic
non-linearity contrasted to classical linear approaches are long-proven [7]:
Quite to our surprise, connectionist networks consequently out-
performed all other methods received, ranging from visual pre-
dictions to sophisticated noise reduction techniques.
The problem is to find a proper substrate for their execution. The real ap-
plications tend to require large networks and process many vectors quickly.
The ordinary HW is extremely inefficient here. Furthermore, as the field of
theoretical neuroscience develops and the electrophysiological evidence ac-
cumulate, researchers need more and more efficient computational tools for
the study of neural systems. In cognitive neuroscience and neuro-informatics
the neurosimulation is the essential approach for understanding the com-
plex brain processing. The computational resources required far exceed
those available to researchers.
There is always demand for faster computing. The two approaches to
speed up are: the speed daemon, which means to wait for the faster pro-
cessors and briniac, running many processors simultaneously. The latter is
supported by CMOS technology packaging billions of gates in micro-areas.
All that is left to do is learn how to connect them efficiently. Those who are
familiar with parallel architectures know that the subject is all about scala-
bility. You cannot just take a handful of fast processors and obtain a linear
speedup, since the computation overhead along with the synchronization
and communication inevitably involved limit the performance growth.
That is not the case for the brain, which demonstrates the tremendous
scalability — from a few neurons in the primitive species up to 100 billion in
human at unprecedented connectivity of 10 000 per neuron. As [8] points
out in his review, the terms ‘parallel architectures’ and ‘neural networks’
are so close that are often used as synonyms.
Summarizing the classical ANN models presented above, the neurosim-
ulation consists of two phases: 1) in the learning phase, the weights are
adjusted in accordance with input examples; 2) inputs are mapped to out-
puts during recall. The neuron processing is simple: real-valued activation
function is applied to the weighted sum of inputs (scalar multiplication)
and the resulting activation value is broadcasted over to other nodes. Ba-
sically, two simple operations are needed: multiply-and-accumulate and
activation function F (W · X ). Nevertheless, owing to the large number of
neurons, which are massively interconnected at that, the workload in recall
phase ends quite involved. The learning phase is even more burdensome.
However, all the synapses and neurons operate independently.
The classical ANN models suggest that the information and computa-
tion are uniformly distributed over the network in such way that the stored
10
objects are memorized in synapses so that every synapse bears information
on all the memorized objects. This utmost diffusion of information is op-
posed to the unambiguity pursued in the classical (mechanic, symbolic)
algorithms and data structures. This revolutionary concept, the highest
degree of distribution, is defined as connectionism. Confining all the com-
putation right into the connections would result in to the truly wired logic.
The inherently parallel neural models can take most of the highly par-
allel machines. However, the redundancies and simulation of neural op-
erations instead of their implementation in HW make the general purpose
supercomputers expensive and slow. The HW is fast and efficient when it
is ‘in line’ with the model it executes. The degree of parallelism inherent
to the neuroprocessing is provoking for the parallel execution. The neu-
roscience inspires the computer scientists to look for optimal substrate for
the neural models — the fast and efficient parallel HW architectures. Mas-
sively parallel VLSI implementations of NNs would combine the high per-
formance of the former with good scalability of the latter.
11
Chapter 3
Traditional Approach
12
Figure 3.1: Taxonomy of neural platforms
13
Figure 3.2: Mapping an ANN (guest graph) onto a parallel machine (host)
[9]
3.3 Benchmarking
The performance measurements play a key role in deciding about the
applicability of a neuroimplementation. Yet, because of immaturity, there
is no standard benchmarking application like TEX and Spice we have in
the ordinary computing. Only few exceptions exist, like NETtalk, and they
are sometimes used for comparisons. The most commonly accepted CPS
(Connections Per Second), which measures how fast a network performs
the recall phase, and CUPS (Connection Updates Per Second). Implemen-
tations are compared against Alpha workstation and alternative designs.
This measure is blamed more deceptive [7] (EPFL) however than the FLOPs
for two reasons:
• it does not define neither network model nor its size and precision.
A more complex neuron can replace many simpler ones and prolong
data lifetime on PE, considerably reducing the communication, which
is crucial for I/O-bound NNs. The missing benchmark application al-
lows the developers to choose the best-case network and misses any
information on the capability of the structure to adapt different prob-
lems.
14
(a) A Ring Architecture (b) A Bus Architecture
15
put to the next node for the accumulation. Once all the sums are computed,
the nodes apply the activation function and start a new round.
Control-parallel architectures perform processing in a decentralized man-
ner, allowing different programs to be executed on different processors
(MIMD). A parallel program is explicitly divided into several different tasks
which are placed on different processors. The communication schema is
usually general routing, i.e. the processors are message-passing computers.
The transputers are the most popular control-parallel neural simulations.
Making this paper in a course of parallel architectures, it is curious to
note that [8] identifies data-parallel decomposition with the SIMD architec-
ture and multiprocessor. This ‘class of equivalence’ is opposed to another
consisting of the parallelized control, MIMD architecture and message-
communicating multicomputer. The author lists some neural simulations
on general-purpose computers. The summary is presented in table 3.1.
Author concludes that data-parallel techniques “significantly outperform
their control-parallel counterparts”. In my opinion, it is not fair to com-
pare the power of group of six transputers against kilo-processor armies.
Admitting linear scalability of transputers, the data shows that they are an
order of magnitude faster. Yet, from theoretical point of view, it is a reason-
able to think that the data-parallel architectures are a natural mapping of
neuroparadigm since the neural computations are most often interpreted
in terms of synchronous matrix-vector operations. For this reason, it is
not surprising that control-parallel architectures are programmed in data-
parallel style. [9] explains that MIMD is used forcefully because production
of SIMD has stopped a long time ago.
Curiously, [12] finds the Beowulf cluster the most attractive in his re-
view on simulating NNs by parallel general-purpose computers. Equipped
with high speed connection network such as Myrinet, Beowulf offers excel-
lent performance at a very competitive price. This cost advantage often can
be as high as an order of magnitude over multiprocessor machines of com-
parable capabilities.
Programming neural networks on parallel machines requires high-level
techniques reflecting both inherent features of neuromodels and character-
istics of the underlying computers. To simplify the task of neuroscientist,
a number of parallel neurosimulators were proposed on general-purpose
machines. Some institutions develop libraries for MIMD supercomputers
that enable NN developers to use the supercomputer efficiently without
specific knowledge[9]. Others develop professional portable neurosimu-
lators, like NEURON and NCS. The compromise between portability and
efficiency is usually achieved by parallel programming environments, e.g.
Message-Passing Interface (MPI), Parallel Virtual Machine (PVM), Pthreads and
OpenMP, on heterogeneous and homogeneous clusters and multiproces-
sors.
Simulations on general-purpose parallel computers were mostly done
16
in late eighties. A large number of parallel neural network implementation
studies have been carried out on existing the massively parallel machines
listed below, simply because neural hardware was not available. Although
these machines were not specially designed for neural implementations, in
many cases very high performance rates have been obtained. The universal
computers still remain popular in the neurocomputing because they are
more flexible and easier to program.
Zhang et al. [13] have used node-per-layer and training prallelism to
implement backpropagation networks on the Connection Machine. Each
ptocessor is used to store a node from each of the layers, so that a ‘slice’
of nodes lies on a single processor. The number of processors needed to
store a network is equal to the number of nodes in the largest layer of the
network. The weights are stored in a memory structure shared by a group
of 32 processors reflecting CM specific architecture. With 64 K processors,
the CM is a perfect candidate for training-example parallelism. The au-
thors use the network replication to fully utilize the machine. The NETtalk
implementation achieves peak performance of 38 CUPS and 180 CPS.
The MasPar implementation [14] similarly to Zhang’s implementa-
tion exploits both layer and training-session parallelism. Each processor
stores the weights of corresponding neurons in its local memory. In the
forward phase, the weighted sums are evaluated, with intermediate results
rotated from right to left of the processor array using MasPar’s local in-
terconnect. Once the input values have been evaluated, sigmoid activation
functions are applied and the same procedure is repeated for the next layer.
In the backward phase, a similar procedure is performed, with errors prop-
agated from the output down to the input layer. After performing a num-
ber of training examples on multiple copies of the same natwork, wheights
are synchronously updated. Maximal NETtalk performance obtained is
176 CPS and 42 CUPS.
Rosenberg and Belloch [15] have used node and weight parallelism
to implement backpropagation networks on a one-dimentional array with
one processor being allocated to a node and two processors to each side of
connection: input and output. Connection-processors multiply the values
by their respective weights. The nodes accumulate the products and com-
pute sigmoids. The backpropagetion is done in a similar way. The NETtalk
maximum speed achievs 13 MCUPS.
Pomerleau et al. [16] have used training and layer parallelism to im-
plement backpropatation network on a Wrap computer with processors
organized in a systolic ring. In the forward phase, the activation values
are shifted circularily along the ring and multiplied by the corresponding
weights. Each processor accumulates the parital weighted sum. When the
sum has been evaluated, the activation function is performed. In backward
processing, is similar, but instead of activation values, accumulated errors
are shifted circularly. Performance measurements for the NETtalk applica-
17
Structuring Paralle- Num of Computer Performance
Technique lism procrs architecture CPS CUPS
COARSE training, 64K Connection Machine 180M 38M
layer (Zhang 90)
COARSE training, 16K MasPar (Zell 90) 176M 42M
layer
FINE node, 64K Connection Machine 13M
weight (Rosenberg 87)
PIPELINED training, 10 Warp (Parmelau 88) 17M
layer
PIPELINED layer, 13K Systolic Array 148M
node (Chung 92)
COARSE partitions 6 Transputers 207K
(Straub 91)
18
Figure 3.3: The fractal topology and MindShape architecture: the similar
module interconnect pattern at all levels of hierarchy
3.5 Neurocomputers
Despite of the advances, the speed and efficiency requirements cannot
be successfully met by general-purpose parallel computers. The general-
purpose neuroarchitectures offer ‘generic’ neural features aiming at a wide
range of ANN models. The neurocomputers can be further specialized for
simulating concrete models and networks.
Architecturally, neurocomputers are large processor arrays — complex
regular VLSI architectures organized in a data-parallel manner. A typical
processing unit of a neurocomputer has local memory for storing weights
and state information. The whole system is interconnected with a paral-
lel broadcast bus, and usually has a central control unit. The data-parallel
programming techniques and HW architectures are most efficient for neu-
ral processing. The dominating approaches are: systolic arrays, SIMD and
SPMD processor arrays.
Important for the design of highly scalable hardware, finding an in-
terconnection strategy for large numbers of processors has turned out to
be a non-trivial problem. Much knowledge about the architectures of these
massively parallel computers can be directly applied in the design of neural
architectures. Most architectures are however ‘regular’, for instance grid-
based, ring-based, etc. Only a few are hierarchical. As was argued in [11],
the latter forms the most brain-like architecture.
19
Analog architectures tend to the full connectivity. Digital chips use
localized communication plan. Three architectural classes of system in-
terconnect can be distinguished: systolic, broadcast bus, and hierarchical
architectures. Systolic arrays are considered non-scalable. According to
many designers, broadcasting the most efficient multiplexed interconnec-
tion architecture for large fan-in and fan-out. It seems that broadcast com-
munication is often the key to success in getting communication and pro-
cessing balanced, since it is a way to time-share communication paths effi-
ciently.
2D architectures are less modular and less reconfigurable as the data
flow is quite rigid. At the same time, they allow throughputs much higher
than 1D architectures.
Implementing neural functions on special purpose chips speeds the
neural iteration time up by about 2 orders of magnitude compared to general-
purpose µP. The common goal of the neurochip designers is to pack as
many processing elements as possible into a single silicon chip, thus pro-
viding faster connectivity. To achieve this, developers limit the computa-
tion precision. [11] remarks that overfocusing on this shoves back the inter-
chip connectivity issue, which is also important for their integration into a
large-scale architecture.
Digital technology has produced the most mature neurochips, provid-
ing flexibility-programmability and reliability (stable precision compared
to analog) at relatively low costs. Furthermore, due to mass-production, a
lot of powerful tools to custom design are available. Numerous programs
for digital neurochip design are offered, all major microchip companies and
research centers world-wide announced their neuroproducts. Digital im-
plementations use thousands of transistors to implement a single neuron
or synapse.
On the other hand, these computationally intensive calculations are au-
tomatically performed by analog physical processes such as summing of
currents or charges. The operational amplifiers, for instance, are easily built
from single transistors and automatically perform synapse- and neuron-
like functions, such as integration and sigmoid transfer. Being natural,
analog chips are very compact and offer high speed at low energy dissi-
pation. Simple neural (non-learning) associative memory chips with more
than 1000 neurons and 1000 inputs each can be integrated on a single chip
performing about 100 GCPS [11]. Another advantage is ease of integration
with real world while digital counterparts need AD-DA converters.
The first problem why analog did not replace digital chips is a lack of
flexibility: analog technology is unusually dedicated to one model and re-
sults in scarcely-usable neurocomputer. Another problem of representing
adaptable weights limits the applicability of analog circuits. Weights can,
for instance, be represented by resistors, but such fixing of weights in pro-
duction of the chips makes them not adaptable — they can only be used in
20
the recall phase. The capacitors suffer of limited storage time and trouble-
some learning. Off-chip training is sometimes used with refreshing the ana-
log memory. For on-chip training, statistical methods, like random weight
changing, are proposed in place of back-propagation because its complex
computation and non-local information make it prohibitive. Other mem-
ory techniques are incompatible with the standard VLSI technology.
In addition to the weight storage problem, analog electronics is suscep-
tible to temperature changes, (interference) noise, and VLSI process vari-
ations that make the analog chips less accurate and harder to understand
what exactly is computed, complicates the design and debug. At the same
time, the practice shows that realistic neuroapplications often require accu-
rate calculations especially for back-propagation.
Taking these drawbacks into account, the optimal solution would ap-
pear to be a combination of both analog and digital techniques. Hybrid
technology exploits advantages of the two approaches. The optimal com-
bination applies digital techniques to perform accurate and flexible training
and uses potential density of analog chips to obtain finer parallelism on a
smaller area in the recall phase.
Here, we do not consider the optical technology, which introduces pho-
tons as basic information carriers. They are much faster that electrons and
have less interference problems. In addition to the greater potential com-
munication bandwidth, the processing of light beams also offers massive
parallelism. These features put optical computing first among possible
candidates form the neurocomputer of the future. The optics ideally suits
to the realization of dense networks of weighted interconnections. Spatial
optics offers 3-D interconnection networks with enormous bandwidth and
very low power consumption. Besides optoelectronics, electro-chemical,
and molecular are also very promising. Despite enormous parallel pro-
cessing and 3D connection prospects of optical technology, silicon technol-
ogy continues to dominate with more and more neurons being packed on
a chip.
Neurocomputers are popular in the form accelerator boards added to
personal computers.
The CNAPS System (Connected Network of Adaptive Processors, 1991)
developed by Adaptive Solutions became one of the most well known com-
mercially available neurocomputers. It is build of N6400 neurochips that
itself consist of 64 processing nodes (PN) that are connected by a broad-
cast bus in a SIMD mode. Two 8-bit buses allow the broadcasting of input
and output data to all PNs and easily adding more chips. Additionally, the
buses connect PNs to the common instruction sequencer.
The PNs are designed like DSPs including fixed-point MAC and equipped
with 4 KB of local SRAM for holding the weights — one matrix for learning
and one for back-propagation learning. It limits system size: the perfor-
mance drops dramatically when 64 PNs try to communicate over the two
21
buses, which becomes necessary when network and weight matrix grow.
The complete CNAPS system may have 512 nodes connected to a host
workstation and includes SW support. It uses layer decomposition and
offers a maximum performance of 5.7 GCPS and 1.46 GCUPS tested on a
backpropagation network. The machine can be used as a general-purpose
accelerator.
The SYNAPSE System (Synthesis of Neural Algorithms on a Parallel
Systolic Engine) is build by Siemens in 1993 of MA-16, the neurochps de-
signed for fast 4x4 matrix operations with 16-bit fixed-point precision. The
chips are cascaded to form a systolic array: one MA-16 chip outputs to an-
other in a pipelined manner ensuring optimal throughput. Two parallel
rings of SYNAPSE-1 are controlled by Motolola processors. The weights
are stored off-chip in 128MB SDRAM. Similarly to CNAPS, wide range of
NN models are supported but, in opposite to SIMD, programming is diffi-
cult because of complex PEs and 2D systolic structure. The system is pack-
aged with SW to make neuroprogramming easier. Each chip throughput is
500 MCPS, the full system performs at 5.12 GCPS and 33 MCPUS.
The RAP System, developed at Berkley in 1993, is a ring array of DSP
chips specialized for fast dot-product arithmetic. Each DSP has a local
memory (256 KB of static RAM, and 4 MB of dynamic RAM) and a ring
interface. Four DSPs can be packed on a board, with a maximum of ten
boards. Each board has a VME bus interface with host workstation. The
processing is performed in a SPMD manner. Several neurons are mapped
onto a single DSP in layer decomposition style. The maximum speed of
10-board system is estimated at 574 MCPS and 106 MCUPS.
The SAIC SIGMA-1 neurocomputer is a PC computer with a DELTA
floating-point processor board and two software packages: an object ori-
ented language and a neural net library. The coprocessor can hold 3 M vir-
tual processing elements and connections, performing 2 MUPS and 11 MCPS.
The Balboa 869 co-processor board for PC and Sun workstations is in-
tended to enhance the neurosoftware package ExploreNet. It uses Intel i860
as a central processor and reaches the maximum speed of 25 MCPS for a
backpropagation network in the recall phase, and 9 MCUPS in the learning
phase.
The Lneuro (1990, Learning Neurochip) implemented by Philips im-
plements 32 input and 16 output neurons. By updating the whole set of
synaptic weights related to a given output neuron is in parallel, a sort of
weight parallelism is reached. The chip comprises on-chip learning with
an adjustable learning rule. A number of chips can be cascaded within a
reconfigurable, transputer controlled network. The experiments with 16
LNeuro 1.0 chips report 8x speed-up compared to an implementation on
a transputer. Measured performance: 16 LNeuros on 4 dedicated boards
show 19 MCPS, 4.2 MCUPS. The authors guarantee a linear speed-up with
the size of machine.
22
(a) CNAPS made of 6400 chips
(b) SYNAPSE
(c) MANTRA: Systolic Array of Genes (d) MANTRA: Genes IV processing ele-
VI chips ment
23
Mantra I (1993, Swiss Federal Institute of Technology) is aimed at a
multi-model neural computer which supports several types of networks
and paradigms. It consists of a 2-D array of up to 40x40 GENES IV sys-
tolic processors and the linear array of auxiliary processors called GACD1.
The GENES chips (Generic Element for Neuro-Emulator Systolic arrays)
are bit-serial processing elements that perform vector/matrix multiplica-
tions. The Mantra architecture is in principle very well scalable. It is one
of the rare examples of synaptic parallelism. It shares the difficult recon-
figurability and programming with SYNAPE. The slow controller and se-
rial communication limit the performance. Performance: 400 MCPS, 133
MCUPS (backpropagation) with 1600 PEs.
BACHUS III (1994, Darmstadt University of Technology, Univ. of Dus-
seldorf, Germany) is chip containing the functionality of 32 neurons with
1 bit connections. The chips are mounted together resulting in 256 simple
processors. The total system was called PAN IV. Chips are only used in the
feed forward phase; learning or programming is not supported and thus
has to be done off-chip. The system only supports neural networks with
binary weights. Applications are to be found in fast associative databases
in a multi-user environment, speech processing, etc.
Analog Mod2 neurocomputer (Naval Air Warfare Center Weapons Di-
vision, CA, 1992) system incorporates neural networks as subsystems in
a layered hierarchical structure. The Mod2 is designed to support par-
allel processing of image data at sensor (real-time) rates. The architec-
ture was inspired by the structures of biological olfactory, auditory, and
visual systems. The basic structure is a hierarchy of locally densely con-
nected, globally sparsely connected networks. The locally densely inter-
connected network is implemented in a modular/block structure based
upon the ETANN chip. Mod2 is said to implement several neural network
paradigms, and is in theory infinitely extensible. An initial implementation
consists of 12 ETANN chips, each able to perform 1.2 GCPS.
Epsilon, 1992, the (Edinburgh Pulse Stream Implementation of a Learn-
ing Oriented Network) developed in Edinburgh University is a hybrid large-
scale generic building block device. It consists of 30 nodes and 3600 synap-
tic weights, and can be used both as an accelerator to a conventional com-
puter and as an autonomous processor. The chip has a single layer of
weights but can be cascaded to form larger networks. The synapses are
formed by transconductance multiplier circuits which generate output cur-
rents proportional to the product of two input voltages. A weight is rep-
resented by fixing one of these voltages. In neuron synchronous mode,
the first uses pulse width modulation and is specially designed with vi-
sion applications in mind. The asynchronous mode is provided by pulse
frequency modulation, which is advantageous for feedback and recurrent
networks, where temporal characteristics are important. The synchronous
implementation was successfully applied to a vowel recognition task. An
24
MLP network consisting of 38 neurons (hidden and output) was trained by
the ‘chip in loop method’ and showed performance comparable to a soft-
ware simulation on a SPARC station. With this chip it has been shown that
it is possible to implement robust and reliable networks using the pulse
stream technique. Performance: 360 MCPS.
3.6 FPGAs
The massively parallel and reconfigurable FPGAs very well suit to im-
plement the highly parallel and dynamically adoptable ANNs. In addition,
being general-purpose computing devices, FPGAs offer the level of flexibil-
ity for many neuromodels and are also useful for pre- and post-processing
the interface around the network in conventional way.
However despite the custom-chip fine-grain parallelism offered, the
FPGA is not true digital VLSI; they are one order of magnitude slower. Yet,
the newest FGPAs incorporate ASIC multipliers and MAC units that have
considerable effect in the multiplication-rich ANNs. The floating-point op-
erations are impractical; particularly, the non-linear activation (sigmoid)
function, which is too expensive in direct implementation, is usually lin-
early approximated peace-wise.
The reconfigurability permits the neural morphing. During training,
the topology and the required computational precision for an ANN can
be adjusted according to some learning criteria. The [19] review refers
two works that used genetic algorithms to dynamically grow and evolve
ANNs-based cellular automata and implemented the algorithm, which sup-
ports on-line pruning and construction of network models.
Reviewers mention the need for more friendly learning algorithms and
software tools.
Below are some examples of implementing different models on RAP-
TOR2000 board. Because of flexibility, many other neural and conventional
algorithms can be mapped on the system and reconfigured at runtime.
25
(a) The prototyping board (b) SOM architecture
26
hattan distances instead of Euclidean to avoid multiplications and square
roots. An interesting trick to start PE learning in fast, 8-bit precision con-
figuration for rough ordering of the map and are reconfiguration to the
slower 16 bits for fine-tuning is demonstrated. The number of cycles per
input vector depends on input vector length l and number of neurons/PE
n: crecall = n · (l + 2dld(l · 255)e + 4) and is almost twice for learning.
Achieved 65 MHz clocking, XCV812E-6 outperforms the 800 MHz AMD
Athlon more than 30 times.
Another application is Binary Neural Associative Memory. Using sparse
encoding, i.e. when almost all bits of input and output vectors are ‘0’, best
storage efficiency and almost linear scalability is achievable for both recall
and learning. Every processor works on its part of neurons but large stor-
age is required. More than million associations can be stored on six Viretex
modules (512 neurons per FPGA) using external SDRAM. Every FPGA has
a 512-bit connection with the SDRAM bus and every neuron processes one
column of the memory matrix. This 50MHz implementation is limited by
SDRAM access-time and results in 5.4 µs.
The last sample app is (Radial Basis) Function Approximation. A net-
work with flexible number of hidden neurons is trained incrementally:
if good approximation is not achieved, a neuron is added and learning
restarts. This also minimizes the risk of local minimum. A number of iden-
tical PEs compute their neurons in parallel. The data selectors assign the
inputs and select correct outputs, which are summed up in a global accu-
mulator. Simultaneously, error calculation unit analyzes the error, which
is submitted to controller and PEs for weight update. Such implementa-
tion can run at 50 MHz with the number of cycles/recall given by: c =
l + NPE + d NNneur
PE
e((4 · l + 5) + 2).
3.7 Conclusions
During the past five decades, the most frequently used types of artifi-
cial neural networks have been the perceptron-based models. Implemen-
tation projects such as those reported above are giving rise to new insights,
insights that most likely should have never emerged from simulation stud-
ies. However, the neurocomputers did not uncover all the potential for fast,
scalable and user-friendly neurosimulation.
In the late 1980s and early 1990s neurocomputers based on digital neu-
rochips reached the peak of their popularity — some even came out from
the research laboratories and entered the market. However, the progress
stalled — initial enthusiasm decreased because 1) user experience in solv-
ing the real-world problems was not very satisfactory and 2) in competition
with the general-purpose µP that grew according to Moore Law. Develop-
27
Figure 3.6: Neurocomputer Performances [11]
ment of custom chips1 is very expensive (and especially hard for the con-
nectionists who are not familiar with such things as VHDL), they are less
programmable and it turns out better to rely on the massively produced
and exponentially growing general-purpose µP. DSPs are especially popu-
lar because of highly parallel SIMD-style MACs for synapse and tightly in-
tegrated FPU for sigmoid computation. The relatively new FPGAs, which
are also general-purpose computation devices, surpass their performance
one order of magnitude. These implementations will of course never be
maximally efficient and fast as dedicated chips.
If you look at the diagram 3.7, neurocomputers are 2 orders of magni-
tude faster than the general-purpose multiporcessors. The neurocomputers
made of neurochips are additionally 2 magnitudes faster than the µP-based
counterparts. The analog technology brings two additional orders.
Later works show that the analog inaccuracy can be an advantage in the
“inherently fault-tolerant” neurocomputing remarking that the ‘wetware’
of real brains keeps going surprisingly well in the wide range of tempera-
tures and variety of neurons. But recall that CPS is a vague figure. It was
realized that the wetware, which constitutes the animal brains, uses more
powerful spike-based models. The relatively recent, last decade, trend was
to move to the spiking neural networks [principles of designs for large-
scale], which are much more powerful yet allow simulating more neurons
with less HW.
1 [22]
justifies it only 1) for large system solutions and 2) when the topological and com-
putational model flexibility by user-simple description is provided
28
Chapter 4
Spiking NNs
29
Figure 4.1: The Action Potential and Spiking NN
4.2 Sample HW
4.2.1 Learning at the Edge of Chaos
Any internal dynamics emerges from the collective behavior of inter-
acting neurons. This is a product of the neuron coupling. As mentioned by
[21] authors, it is, therefore, necessary to study the coupling factor — the
average influence of one neuron upon another.
They experiment on STDP by building a video-processing robot, which
learns to avoid the obstacles (walls and moving objects) with the purpose
to investigate the internal dynamics of the network. A Khepera robot with
a linear (1-D horizon vision) camera and collision sensors is used for ex-
periment. The video image is averaged to 16 pixels, which are fed to 16
input neurons, processed by 40 fully recurrent hidden neurons and two
output neurons that control two motors. The weights in the fully recur-
rent network are initialized randomly with some variance from the normal
distribution center.
The full-black color is supplied by 10 Hz spikes, full-white corresponds
to 100 Hz. In average, input spike happens every 100 steps, meantime the
network keeps firing and learning. The STDP learning factor α = ±const
depending on if robot moves or hits a wall and 0 otherwise. Started at
chaotic firing, neurons synchronize with each other and external world.
However, fast synchronization is not always good. The network must
exhibit two contradictory dynamic features: the plasticity to remain re-
sponsive and the “autism” to maintain stability of internal dynamics es-
pecially in case of noisy environment. Authors look at the average mem-
brane potential developing in time: m(t) = ∑ Vi (t). The coupling must be
high enough to avoid ‘neural death’ then its dynamics evolves from ini-
tial chaotic firing to synchronous mode. At weak coupling, this measure
is chaotic, neurons fire asynchronously and aperiodically, robot behaves
almost randomly. Increased variance favors the increased periodicity and
30
Figure 4.2: NeuroPipe-Chip on MASPINN board
• a spike event list that models the axon delays: it stores spike source
neurons along with spike time2 . The fact that the next time slot is
computed from the data of the previous one allows the network to be
processed in parallel.
31
num Alpha SPIKE 128k ParSpike MASIPNN
of 500 MHz 10 MHz 100 MHz 100 MHz
neurons FPGA 64 DSPs NeuroPipe-Chip
1K 0.56ms 1ms 1ms 6.5µs
128K 67ms 10ms 1ms 0.83ms
1M 650ms — 8ms 6.5ms
32
Figure 4.3: SNN on analog VLSI. Axon drives a row. A column of synapses
feeds a neuron at the column bottom. A neuron spikes are converted into
axon current, which drives a row.
33
science, we see that Maass started by studying the hard, symbolic automa-
tion and Turing machines, then moved to analog and neural networks ap-
proaching the Markram’s brain research. Together they have developed a
‘liquid state machine’ (LSM), a kind of SNN, to be used in real-world ap-
plications. This is needed for the following reasons.
The authors point out that the computer science lacks the universal
model of organization of computations in cortical microcircuits that are
capable of carrying out potentially universal information processing tasks
[24]. The universal computers adopted, the Turing machines and attrac-
tor ANNs, are inapplicable because they process the static discrete inputs
whereas the neural microcircuits carry out computations on continuous
streams of inputs. The conventional computers keep the state in a num-
ber of bits and are intended for static analysis: you record the input data
i and revise this history, in order compute the output at current time t:
o (t) = O(i1 , i2 , ..., it ). This costs HW and time. However in the real-world,
there is no time to wait until a computation has converged — results are
needed instantly (anytime computing) or within a short time window (real-
time computing). The computations in common computational models
are partitioned into discrete steps, each of which require convergence to
some stable internal state, whereas the dynamics of cortical microcircuits
appears to be continuously changing (the only stable is the ‘dead state’).
The biological data suggest that cortical microcircuits may process many
tasks in parallel while the most of NN models are incompatible with this
pattern-parallelism. Finally, the components of biological neural micro-
circuits, neurons and synapses, are highly diverse and exhibit complex
dynamical responses on several temporal scales, what makes them com-
pletely unsuitable as building blocks of computational models that require
simple uniform components, such as virtually all models inspired by com-
puter science or ANNs. These observations motivated the authors to look
for alternative organization of computations, calling this ‘a key challenge
of neural modeling’. The proposed framework not just compatible with the
aforementioned constrains, it requires them.
Every new neuron adds a degree of freedom to the network, making its
dynamics very complicated. The conventional approaches are, therefore,
to keep the (chaotic) high-dimentional dynamics, under control or work
only with the stable (attractor) states of the system. This eliminates the
inherent property of NNs to continuously absorb information about inputs
as a function of time. This gives an idea for explaining ‘how a continuous
stream of multi-modal input from a rapidly changing environment can be
processed by stereotypical recurrent integrate-and-fire neuron circuits in
real-time’.
The authors look at NN as a ‘liquid’, which dynamics (state) is ‘per-
turbed’ by inputs. All the temporal aspects of input data are digested into
the highly-dimensional liquid state. The desired function output is ‘read
34
Figure 4.4: The Liquid State Computing
out’ from the (literally) current state of the liquid by another NN. That is all:
the high-dimensional dynamical system formed by neural liquid serves a
universal source of information about past stimuli for the readout neurons,
which implement the extract particular aspects needed for diverse tasks in
real-time.
Owing to the fact that
1. The liquid is fixed — its connections and synaptic weights are ran-
domly predefined; and
2. The only part that learns is the readout, which is memory-less (it
relies only on the current state of the liquid ignoring any previous
states) and can thus be as simple as 1-layer perceptron
35
Figure 4.5: Hard Liquid. a) A structural block. b) The interpolated Mem-
ory Capacity for different weight distributions (the points). The largest 3
distributions are highlighted).
inputs.
36
Notably, the liquid’s major quality to serve as a memory storage is mea-
sured in bits. At every iteration of generation parameter sweep (a dot in
fig. 4.5(b)), a number of liquids were generated and readouts trained for the
same function. The average MC distinctly peaks along the hyperbolic band.
This band shows a sharp transition from the ordered dynamics (area below)
to the chaotic behavior (above). To estimate the ability to support multiple
functions, multiple linear classifiers were experimented on the same liquid.
The mean MI shows that the critical dynamics yield a generic (independent
of the readout) liquid.
The experiments reproduce the earlier published theoretical and sim-
ulation results showing that the linear classifiers can be successful when
the liquid exhibits the critical dynamics between the order and chaos. The
experiments with this general purpose ANN ASIC allow to explore the nec-
essary connectivity and accuracy of future hardware implementations. The
next step planned is to use area of the ASIC to realize the readout. Such an
LSM will be able to operate in real-time on continuous data streams.
37
Usability
The networks were designed with extreme scaling in mind. They sup-
port short messages (as small as 32 bytes) and HW collective operations
(broadcast, reduction, barriers, interrupts, etc.).
Developing the machine at ASIC level allowed to integrate the reliabil-
ity, accessibility and survivability (RAS) functions into single-chip nodes,
so that the machine would stay reliable and usable even at extreme scales.
The feature is crucial, since the probability of machine to start approaches
zero as the number of nodes grows. In contrast, clusters typically don’t
possess this goodness at all.
The full potential cannot be disclosed without system SW, standard li-
braries and performance monitoring tools. Though BG/L was designed
to support both distributed-memory and message-passing programming
models efficiently, the architecture is tuned for the dominant MPI interface.
From the user perspective the BG/L appears as up to 216 compute node net-
work. But this is not an architectural limit. Every 1024 nodes are assembled
into a rack consuming 0.92 × 1.9 m3 of space and 27.5 kW of power.
38
perature sensors, power supplies, clock trees and etc. — more than 250 000
endpoints for a 64 k machine. A 100 Mb Ethernet connects them to the host.
A partition mechanism, based on link nodes, enables each user to have a
dedicated set of nodes. The same mechanism also isolates any faulty nodes
(once fault is isolated, the program restarts at the last RAS checkpoint).
39
‘utilize a very fine grain parallelism’. It is also surprising that this cluster
outperforms the BG! The personnel guess that besides 3x slower CPUs, 1)
the Beowulf’s Myrinet is better than the supercomputer’s 3D torus; and
2) that the NN distribution is optimized for the cluster. The BG profiling
tools show that one order of magnitude SW performance improvement is
possible.
Now, I see that Maass builds a similar simulator. All this suggests that
BB is just a branch of this project. Indeed, [30] confirms that BB runs Good-
man simulator. As of 2008, BB reports that 8 kCPU BG has fulfilled the goal
— rat’s cortical column has been recreated.
4.5 Conclusions
In the last decade, the attention has switched to the more realistic spik-
ing NNs, which are theoretically much more powerful than the conven-
tional ones. The topology is as important to the capacity of the network
as the size: the optimal quality was found at the edge between ‘order and
chaos’. Exploiting the local connectivity along with low network activity
in the form of event-list and disabling decayed dendrites, the event-driven
neurocomputers made of custom digital chips may deliver almost any sim-
ulation performance.
Yet, looking at semiconductor roadmap, the enormous gap between the
digital performance and requirements to simulate large parts of the brain
that cannot be bridged by the digital VLSI. The digital computer is Turing
paradigm-based: it is inherently sequential, repeatedly executes simple op-
erations on some kind of data stored in memory [23]. This is fundamentally
opposite to the nervous system, where continuous-mode neurons process
multiply tasks in parallel in real-time. The developers of reservoir comput-
ing find a flaw in the classical, stable-states-based, approach. The idea to
let inputs cause perturbations to the transient network state. In conjunction
with a simple one-layer readout trained for user problem this eliminates
the computation-expensive learning and, by processing continuous input
streams in real-time, opens the ANNs for the real-world problems. Inspired
by complex and ‘analog’ system, the biological nervous system, LSM may
serve as non-Turing universal computer. The physical computer structure
must match the neural model, which means it must be analogue VLSI. An
optimal analog computation substrate mixed with the digital technology
for weight storage and signaling was presented.
40
Chapter 5
Conclusions
[3]
41
nature. Essential research is done by simulation. Because new qualities are
exhibited in larger networks, the research is limited by available comput-
ing power. Being highly-parallel, ANNs are not limited by Amdahl law
and there is always demand for more parallel computing. Choosing the
platform the speed, efficiency (cost, power, space), flexibility and scalabil-
ity factors are considered.
For economical reasons, it is preferred to simulate networks of small
and medium size on lately considerably enhanced workstations. The first
stop in parallel simulation track is a cluster interconnected by low-latency
networks. Its performance is quite competitive with the supercomputers
even without looking at the highly important cost factor. But if you can
afford one, the data-parallel (SIMD) architectures, the ones with one cen-
tral processor, perform better in neural processing than the control-parallel
(MIMD), presumably due to more optimal routing, automatic data broad-
cast and synchronization and code-processing redundancy eliminated. The
general-purpose computers are user friendly and easier in migration on the
new HW generations . Despite the special HW cannot compete in flexibility
with the parallel computers, it is faster, not as expensive and more efficient.
The purpose of dedicated HW is to reach supercomputing power at
price of top-range workstation . The neurocomputers are build to sup-
port wide ranges of popular ANN models. They can be built of general-
purpose, custom digital and analog chips, where each improves both speed
and efficiency 2, 4 and 6 orders of magnitude. The analog chips are unreli-
able and have limited trainability. Therefore, it is expected that the hybrid
technology, that combines advantages of analog computations with digital
storage will dominate in the future. As yet, the general-purpose µP s have
regained their popularity because of better programmability and the very
expensive custom design that cannot keep up after the massively produced
universal µP s at exponentially (Moore Law) growing power.
For this reason, DSPs that offer MAC and floating point operations used
to be the most popular building blocks. However now, with the advent
of massively parallel FPGAs, which support these DSP operations, the re-
configurable technology is taking over. Besides the wider range of ANN
models, the flexibility of general-purpose components is able to process
the interface that surrounds the NN. The embedded devices, which tend
to avoid any redundancy simulating concrete models and networks, also
benefit from the reconfigurability.
In the second part we have seen that the intention of neurocomputer
field has moved to the more realistic and powerful spiking networks. Here,
the local connectivity and low network activities are exploited. The best
digital performance is shown on the event-driven machines. Real-time per-
formance is shown on the mixed technology, which uses analog computa-
tion plus digital weight storage and signaling. We have seen how the opti-
mal spike synchronization emerges between the ‘order and chaos’. Such a
42
model runs on the BB. This project ‘opens the horizons for the neuroscience
researchers’ as do all the others examined.
In the final section it is allowed to express my personal impressions
emerged while glancing over the ANN subject. The most curious are the
most fundamental questions, which means the minds and machines, the
consciousness and computability.
Biologically, the authentic live creatures are the genes. In the battle for
survival, they synthesize the protein machines (body) and the nerves for
adequate behavior in the complex and changing environment. A good
brain must build a model of reality to avoid the dangerous experiences.
The consciousness arises when one puts itself inside his model of reality1 . I
see 1) a self-reference recursion here akin to the one we had with the Gödel
sentence trying to answer if a machine can think, and 2) the mind build-up
procedure implies its incompleteness, so it can easily be controversial. This
history also recalls who the brain, this manager of the planet, serves to.
Artificial implementation will have deep consequences. It does not
need to grow up from a single cell and is free of ancient rudiments and,
therefore, may completely concentrate on the computation, it runs at elec-
tronic rather than molecular speeds, it may have unlimited size and un-
ceasingly run forever. Will it look at on its creators like animals? When
Markram says: “the intelligence that is going to emerge if we succeed in
doing that is going to be far more than we can even imagine”, the Singular-
ity occurs.
Personally, I have realized two things doing this work. At first, I have
realized how propaganda works. Raimond Ubar explained us once that
information is repeated in the noise-looking signal and it is how it is sepa-
rated from the background noise (like SETI does). It looks like the brain is
a filter of this kind. When everybody tells you the same, it is recognized as
important and true. This explains dr. Goebbel’s formulation: ‘Keep repeat-
ing a lie and it will become true” and effectiveness of mainstream picture
of the reality used by owners to manufacture the consent and consumers in
their interests in democratic societies.
Secondly, I have realized that the brain is not a hardcoded algorithm2 , it
is continuously shaped by the environment. Particularity, when one plays
chess, he teaches the opponent’s brain and, in result, plays with himself.
More generally, the task is solved not by algorithm, rather by external in-
fluence. Perhaps, this my delusion also explains the inscrutability of mind
— there is just no algorithm to discover, it is constantly morphing. This
returns us the the important in the field of AI question if a machine can
simulate the consciousness.
The idea that algorithm of mind cannot be understood can be proved in
1 Dawkins, “The Selfish Gene”
2 The [3] and [31] have influenced
43
other way. After Truing, we can guess that somebody has comprehended
his mind. He knows then what he must do and violates the algorithm act-
ing in another way.
This non-determinism of mind might come from the quantum world
as Penrose argued. I believe that any beginner in NNs and quantum the-
ory notes this similarity in their magic to traverse the huge problem space
quickly. I note another one: likewise the quantum physics denies the trajec-
tory of a particle, the computability theory fails to trace the train of thought.
The utmost information and computation diffusion in NNs is opposed to
the unambiguity pursued in the classical (mechanic, symbolic) algorithms
and data structures. Aren’t the following facts about connectionist messy
‘fuzzy logic’
44
Chapter 6
Epilogue
As part of the agreement with IBM, some of Blue Gene’s time will also
be allotted to IBM’s Zurich Research Lab working together with scien-
tists from EPFL’s Institutes of Complex Matter Physics and Nanostructure
Physics to research future semiconductor (post-CMOS) technology such as
carbon nanotubes.1 Meantime, BG was designed and is used for DNA and
protein folding research2 , which in itself offers 10 nm patterns for the new
generation of integral circuits in replacement to 65 nm photolithography.
Thom LaBean in Duke University has demonstrated the self-assembly hun-
dreds of trillions of building blocks [32]. As a matter of fact, new transistor
technologies reshape the landscape by far more substantially than any ar-
chitectural solution.
Notes
1 For instance, looking for alternative brain simulations, I encountered the ad.com:
The Artificial Development is privately held company, comprised of an in-
ternational multidisciplinary team of professionals who are working to intro-
duce the world’s first true AI.
CorticalDB is building advanced artificial intelligence technologies that will
reshape business operations on a global scale. The core of these technolo-
gies is CCortex, a massive neural network with breakthrough capabilities in
simulating important aspects of human intelligence, cognition and memory.
CCortex accurately models the billions of neurons and trillions of connections
in the human brain with a layered distribution of spiking neural nets running
on a high-performance supercomputer. This Linux cluster is one of the 20
fastest computers in the world with 500 nodes, 1,000 processors, 1 terabyte of
RAM, 200 terabytes of storage, and a theoretical peak performance of 4,800
Gflops.
It achieves this simulation by dynamically employing the vast amounts of
neurological data derived from the CorticalMap and NanoAtlas projects.
1 http://www.physorg.com/news4402.html
2 http://folding.stanford.edu/FAQ-diseases.html
45
The CorticalMap is a comprehensive database of neurological structures that
represents multiple levels of the brain. It results from extensive neuroscience
literature data mining and contains billions of neurons with trillions of con-
nections, including data on neuron cell types, morphology, connectivity, chem-
istry, physiology and functionality. The NanoAtlas is a 100-nm resolution
digital atlas of entire human brain that is built using innovative whole brain
imaging and modeling techniques in the histology and genetics lab of the
company.
46
Bibliography
[1] Otis Port. Blue brain: Illuminating the mind. BusinessWeek Online,
June 2005. [http://www.businessweek.com/technology/content/
jun2005/tc2005066_6414_tc024.htm].
[6] M.L. Minsky and S.A. Pappert. Perceptrons. M.I.T. Press, 1969.
47
[12] Udo Seiffert. Artificial neural networks on massively parallel com-
puter hardware. In European Symposium on Artificial Neural Networks,
pages 319–330, April 2002.
[13] J.P. Mesirov X. Zhang, M. Mackena and D.L. Waltz. The backpropaga-
tion algorithm on grid and hypercube architectures. Parallel Comput-
ing, 14(3):317–327, 1990.
[14] T. Sommer A. Zell, N. Mache and T. Korb. Recent developments of
the SNNS neural network simulator. In Proc. Applications on Neural
Networks Conference SPIE, pages 708–719, 1991.
[15] C.R Rosenberg and G. Blelloch. An implementations of network learn-
ing on the CM. In Proc. Joint Conference Artificial Intelligence, pages
329–340, Milan, Italy, 1987.
[16] D.S. Touretzky D.A. Pomerleau, G.L. Gusciora and H.T. Kung. Neural
network simulations at warp speed: How we got 17 million connec-
tions per second. In Proc. IEEE ICNN, pages 134–150, San Diego, Jul
1988.
[17] H. Yoon J-H Chung and S.R. Naeng. A systolic array exploiting the
inherent parallelism of arificial neural networks. Microporcessing and
Microprogramming, 33(3):145–159, 1991/92.
[18] D. Schwarz R. Straub and E. Schöneburg. Simulation of backpropaga-
tion netowrks on tracsputers. Neurocomputing, 2(5&6):199–208, 1991.
[19] Jihan Zhu and Peter Sutton. Artificial neural networks on massively
parallel computer hardware. In Field-Programmable Logic and Applica-
tions, pages 1062–1066. Springer Berlin, September 2003.
[20] Jilles Vreeken. Spiking neural networks, an introduction. Technical
report, 2002.
[21] A. Alwan H. Soula and G. Beslon. Learning at the edge of chaos: tem-
poral coupling of spiking neurons controller for autonomous robotic.
In Spring Symposia on Developmental Robotic, page 6. Proceedings of
American Association for Artificial Intelligence, 2005.
[22] Nasser Mehrtash, Dietmar Jung, Heik Heinrich Hellmich, Tim Schoe-
nauer, Vi Thanh Lu, and Heinrich Klar. Synaptic plasticity in spiking
neural networks (sp2 inn): A system approach. [http://mikro.ee.
tu-berlin.de/spinn/pdf/ieee03.pdf].
[23] K. Meier J. Schemmel and E. Mueller. A new vlsi model of neural mi-
crocircuits including spike time dependent plasticity. In IEEE Interna-
tional Joint Conference on Neural Networks, volume 3, pages 1711–1716,
July 2004.
48
[24] Henry Markram Wolfgang Maass, Thomas Natschlager. Real-time
computing without stable states: A new framework for neural com-
putation based on perturbations. In Neural Computation, pages 2531–
2560. MIT Press, 2006.
[25] Hava Siegelmann. Neural Networks and Analog Computation: Beyond the
Turing Limit. Birkhauser Boston, 1998.
[28] A. Gara et al. Overview of the blue gene/l system architecture. IBM
Journal of Research and Development, 49(2/3):195–213, Mar-May 2005.
[http://www.research.ibm.com/journal/rd/492/gara.html].
49