You are on page 1of 11

Bioinformatics Vs AI

Diwaker Singh

Reg.Id.10808015

B.Tech-MBA (CSE), Lovely Professional University

Punjab

Abstract- This term paper aims to Bioinformatics is the field of science


provide an overview of the ways in in which biology, computer science,
which techniques from artificial and information technology merge to
intelligence can be usefully employed form a single discipline. The ultimate
in bioinformatics, both for modelling goal of the field is to enable the
biological data and for making new discovery of new biological insights as
discoveries. The paper covers three well as to create a global perspective
techniques: symbolic machine learning from which unifying principles in
approaches (nearest-neighbour and biology can be discerned. At the
identification tree techniques); beginning of the "genomic revolution",
artificial neural networks; and genetic a bioinformatics concern was the
algorithms. Each technique is creation and maintenance of a database
introduced and then supported with to store biological information, such as
examples taken from the nucleotide and amino acid sequences.
bioinformatics literature. Development of this type of database
involved not only design issues but the
development of complex interfaces
whereby researchers could both access
INTRODUCTION
This term paper deals with the main existing data as well as submit new or
heading Bioinformatics Vs Artificial revised data.
intelligence. In this term paper I m
Ultimately, however, all of this
explaining about the basic definition of
information must be combined to form
the term Bioinformatics and Artificial
a comprehensive picture of normal
Intelligence? The second part contains
cellular activities so that researchers
the comparison and contrast between
may study how these activities are
Bioinformatics and AI . It deals with
altered in different disease states.
applicaton of AI techniques in the field
Therefore, the field of bioinformatics
of Bioinformatics. From the same
has evolved such that the most
discussion, the difference between the
pressing task now involves the
two would be clear.
analysis and interpretation of various
types of data, including nucleotide and
I. WHAT ARE
amino acid sequences, protein
BIOINFORMATICS AND AI?
domains, and protein structures. The
actual process of analyzing and
interpreting data is referred to as
a. BIOINFORMATICS computational biology. Important sub-
disciplines within bioinformatics and  Automated genome
computational biology include: annotation
 Biological networks inference
 the development and  Comparative genomic
implementation of tools that analyses
enable efficient access to, and  Scientific literature and
use and management of, textual annotation mining
various types of information  Integrative systems biology
approaches
 Chemoinformatics and drug
discovery applications
 the development of new  Personalized medicine
applications
algorithms (mathematical
formulas) and statistics with
which to assess relationships
among members of large data
sets, such as methods to
III. WHY USE AI IN
BOINFORMATICS??
locate a gene within a
sequence, predict protein
The above mentioned tasks in
structure and/or function, and Bioinformatics require a huge amount
cluster protein sequences into of data to be stored and analysed by
families of related sequences the experts. If done manually it will
consume huge amount of time and
b. ARTIFICIAL INTELLIGENCE effort and even if we use computer for
this work we will need a good team
Artificial Intelligence (AI) is the area and an efficient algorithm to
of computer science focusing on accomplish these tasks and also at the
same time we will need humans or
creating machines that can engage on
biologists to draw certain inference
behaviours that humans consider from the present data. Now the word
Intelligent. The ability to create “Efficient” and “drawing inference”
intelligent machines has intrigued leads to the introduction of Artificial
humans since ancient times and today intelligence in this field.
with the advent of the computer and 50
years of research into AI It means the use of artificial
programming techniques, the dream of intelligence has risen from the needs of
smart machines is becoming a reality. biologists to utilize and help interpret
the vast amounts of data that are
Researchers are creating systems constantly being gathered in genomic
which can mimic human thought, research. The underlying motivation
understand speech, beat the best for many of the bioinformatics and
human chess player, and countless DNA sequencing approaches is the
other feats never before possible. evolution of organisms and the
complexity of working with erroneous
data.

II. FUNCTIONS OF We know that artificial intelligence


BIOINFORMATICS can be used to design such expert
systems which require minimum
human resource (even a single person
In Bioinformatics we deal with
can perform the task of 100 persons
storing and analysing data which
using such system), can deliver
can be utilised for following aspects
maximum possible accurate result, and
of Bioinformatics:
provide inference mechanism which
can be used to extract new information
 Protein structure and function from the previously gathered data.
prediction

2
There are several important problems We first introduce decision trees. A
where AI approaches are particularly decision tree has the following
promising properties: each node is connected to a
set of possible answers; each non-leaf
 Prediction of Protein node is connected to a test which splits
Structure its
 Semiautomatic drug design
 Knowledge acquisition from set of possible answers into subsets
genetic data corresponding to different test results;
and each branch carries a particular
test result’s subset to another node.To
see how decision trees are useful for
nearest neighbour calculations, let us
consider 8 blocks ofknown width,
IV. FUNCTIONS OF AI IN height and colour . A new block then
BIOINFORMATICS appears of known size but unknown
colour. On the basis of existing
a. Data Mining information, can we make an informed
b. Bio-Medical Informatics guess as to what the colour of the new
block is?
AI provides several powerful
algorithms and techniques for solving To answer this question, we need to
important problems in bioinformatics assume a consistency heuristic, as
and chemo-informatics. Approaches follows. Find the most similar case, as
measured by known properties, for
like Neural Networks, Hidden Markov
which the property is known; then
Models, Bayesian Networks and guess that the unknown property is the
Kernel Methods are ideal for areas same as the known property. This is
with lots of data but very little theory. the basis of all nearest neighbour
The goal in applying AI to calculations. Although such nearest
bioinformatics and chemo-informatics neighbour calculations can be
is to extract useful information from performed by keeping all samples in
memory and calculating the nearest
the wealth of available data by
neighbour of a new sample only when
building good probabilistic models. required by comparing the new sample
with previously stored samples, there
Data Mining is an AI powered tool are advantages in storing the
that can discover useful information information about how to calculate
within a database that can then be used nearest neighbours in the form of a
to improve actions. decision tree, as will be seen later.

Bio-Medical Informatics in the field For our example problem above, we


of AI is a combination of the expertise first need to calculate, for the 8 blocks
of medical informatics in developing of known size and colour, a decision
clinical applications and the focused space using width and height only
(since these are the known properties
principles that have background
of the ninth block), where each of the
guided bioinformatics could create a 8 blocks is located as a point within its
synergy between the two areas of own unique region of space. Once the
application. 8 blocks have been assigned a unique
region, we then calculate, for the ninth
V. AN EXAMPLE block of known width and height, a
point in the same feature space. Then,
Following example shows how the AI depending on what the colour of its
techniques can be used in the field of ‘nearest neighbour’ is in the region it
Bioinformatics occupies, we allocate the colour of the
nearest neighbour to the ninth block.
Notice that the problem consisted of
Nearest neighbour approach
asking for an ‘informed guess’, not a

3
provably correct answer. That is, in which, the 8 blocks of known size are
many applications it is important to first divided into two equal subsets
attribute a property of some sort to an using, say, the height attribute. The
object when the object’s real property tallest of the shorter subset has height
is not known. Rather than having to 2 and the shortest of the taller subset
leave the object out of consideration, has height 5. The midpoint 3.5 is
an attribution of a property to the therefore chosen as the dividing line.
object with unknown property may be
useful and desirable, even if it cannot
be proved that the attributed property
is the real property. At least, with
nearest neighbour calculations, there is
some systematicity (the consistency
heuristic) in the way that an unknown
property is attributed.

So, for our problem above, we shall


divide up the 8 blocks in advance of
nearest neighbour calculation (i.e.
before we calculate the nearest
neighbour of the ninth block). To do
this, we divide the 8 blocks by height
followed by width, then height, Figure 1(b)
width ... until only one block remains
in each set. We ensure that we divide The next step is to use the second
so that an equal number of cases falls attribute, width,
on either side. The eight blocks with
known colour are first placed on the
to separate each of the two subsets
feature space, using their width and
from Figure 1(b) into further subsets
height measures as co-ordinates
(Figure 1 (c)). For the shorter subset,
(Figure 1 (a)).
the wider of the two narrow blocks is 2
and the narrower of the two wide
blocks is 4. The midpoint is therefore
3. For the taller subset, a similar form
of reasoning leads to the midpoint
being 3.5. Note that the two subsets
have different midpoints.

Figure 1(a)

The ninth object (width 1, height 4) is


also located in this feature space, but it
may not be clear what its nearest
neighbour is (i.e. which region of
space it occupies), that is, the object
could be orange or red. To decide
Figure 1(c)
4
Figure 1(e)
Since each block does not occupy its
own region of space yet, we return to
height (since there are only two known
attributes for each object) and split While this conclusion may have been
each of the four subsets once more reached through a visual inspection of
(Figure 1 (d)). the feature space alone (Figure 1 (a)),
in most cases we will be dealing with a
multi-dimensional feature space, so
some systematic method for
calculating nearest neighbours will be
required which takes into account
many more than two dimensions.

Also, nearest neighbour techniques


provide only approximations as to
what the missing values may be. Any
data entered into a database system, or
any inferences drawn from data
obtained from nearest neighbour
calculations, should always be flagged
to ensure that these approximations are
not taken to have the same status as
‘facts’. New information may come in
Figure 1(d) later which requires these
approximations to be deleted and
replaced with real data.
For the two taller subsets, the
midpoints are both, coincidentally, 5.5 We could of course leave calculating
(between 5 and 6). For the two shorter the nearest neighbour of a new sample
subsets, the midpoints are both, until the new sample enters the system.
coincidentally, 1.5. Each block now But if there are thousands of samples
has its own region of space. Once we and hundreds of attributes, calculating
have divided up the cases, we can then the nearest neighbour each time a new
generate a decision tree, using the mid- sample arrives with unknown
points discovered as test nodes (Figure properties may prove to be a
1 (e)) in the order in which they were bottleneck for database entry. Nearest
found. Once we have the tree, we can neighbour calculations can be used to
then trace a path down the tree for the infer the missing values of database
ninth block, following the appropriate attributes, for instance, and if we are
paths depending on the outcome of dealing with real-time databases the
each test, and allocate a colour to this overheads in calculating the nearest
block (orange). neighbour of new records with missing
values could be a real problem.
Instead, it is more efficient to compute
decision trees for a number of
attributes which are known to be
‘noisy’ in advance of new records so
that entering a new record is delayed
only by the time it takes to traverse the
relevant decision tree (as opposed to
calculating the nearest neighbour from
scratch). Also, once the decision tree is
computed, the information as to why a
new sample is categorised with a
nearest neighbour is readily available
in the decision tree.
5
Analysing such trees can shed new Put simply, the unit receives incoming
light on the domain in question. A activation either from a dataset or
decision tree therefore represents activation from a previous layer and
knowledge about nearest neighbours. makes a decision whether to propagate
the activation. Units contain an
activation function which performs
this calculation and the simplest of
Nearest neighbour example in these is the step or threshold function
bioinformatics (Figure 2).

Typically, bioinformatics researchers


want to find the most similar bio
sequence to another Bio sequence, and
such bio sequences can contain
hundreds and possibly thousands of
‘attributes’, i.e. positions, which are
candidates for helping to identify
similarities between bio sequences.
There will typically be many more
attributes than sequences and therefore
the choice of specific attributes to use Figure 2
as tests will be completely arbitrary
and random. For this reason, nearest
neighbour calculations usually take
into account all the information in all However, the most common activation
positions before attributing a missing function used is the sigmoid function
value. (Figure 3). This is a strictly increasing
function which exhibits smoothness
and asymptotic properties. Sigmoid
functions are differentiable. The use of
such sigmoid activation functions in
VI. ANOTHER EXAMPLE multilayer perceptron networks with
backpropagation contributes to
Neural Networks in Bioinformatics stability in neural network learning.

Artificial Neural Networks (ANNs)


were originally conceived in the 1950s
and are computational models of
human brain function.

They are made up of layers of


processing units (akin to neurons in the
human brain) and connections between
them, collectively known as weights.
For the sake of exposition, only the
most basic neural network
architecture, consisting of just an input
layer of neurons and an output layer of
neurons, will be considered first.
Artificial neural networks (ANNs) are
simulated by software packages which
can be run on an average PC. A
processing unit is based on the neuron
in the human brain and is analogous to Figure 3
a switch.

6
Units are connected by weights which output node values can be compared
propagate signals from one unit to the with the known class values to
next (usually from layer to layer). determine the validity of the network.
These connections are variable in Network validity can be measured in
nature and determine the strength of terms of how many unseen samples
the activation which is passed from were falsely classified as positive by
one unit to the next. The modification the trained ANN when they are in fact
of these weights is the negative (‘false positive’ rate) and vice
versa (‘false negative’ rate). If the
primary way in which the neural class of an unseen sample is not
network learns the data. This is known, then the output node values
accomplished in supervised learning make a prediction as to the class of
by a method known as output into which the sample falls.
backpropagation where the output Such predictions may need to be tested
from the neural network is compared empirically.
with the desired output from the data
and the error from this is used to
change the weights to minimise the
error. More formally and very generally,
the training phase of an ANN starts
by allocating random
weights w1, w2,… wn to the
The task of the ANN is to modify the connections between the n input
weights between the input nodes and units and the output units. Second,
the output node (or output nodes if we feed in the first pattern p of bits
there is more than one output node in x (p), x (p)… xn(p) to the network
1 2

the output layer) through repeated and compute an activation


presentation of the samples with value for the output units given
desired output. The process through
which this happens is, first,
feedforward, whereby the input values
of a sample are multiplied by initially
random weights connecting the input
. That is, each input value is
nodes to the output node, second, multiplied by the weight connecting
comparing the output node value with its input node to the output nodes,
the desired (target) class value and all weighted values are then
(typically 0 or 1) of that sample, and summed to give us a value for the
third back-propagating an error output nodes. Third, we compare the
adjustment to the weights so that the output value for the pattern with the
next time the sample is presented, the desired output value and update each
actual output is closer to the desired weight prior to the input of the next
output. This is repeated for all samples pattern
in the data set and results in one epoch p’: wi( p’)= wi( p)+ ∆ wi ( p)
(the presentation of all samples once).
Then the process is repeated for a where ∆ wi ( p) is the weight
second epoch, a third epoch, etc, until correction for pattern p calculated
the feed-forward back-propagating
(FFBP) ANN manages to reduce the as follows: ∆ wi ( p) = xi ( p) × e(
output error for all samples to an p) , where e(p)= OD ( p)- O( p) ,
acceptable low value. At that point, where in turn OD(p) is the desired
training is stopped and, if needed, a
output for the pattern and O(p) is the
test phase can begin, whereby samples
actual output. This is carried out for
not reviously seen by the trained ANN
are then fed into the ANN, the weights every pattern in the dataset (usually
‘clamped’ (i.e. no further adjustment with shuffled, or random, ordering).
can be made to the weights), and the At that point we have one epoch.
output node value for each unseen The process is then repeated from
sample observed. If the class of the the second step above for a second
unseen samples is known, then the epoch, and a third, etc.

7
Typically, an ANN is said to have
converged or learned when the sum
of squared errors (SSE) on
the output nodes for all patterns in
one epoch is sufficiently small
(typically, 0.001 or below).
The equations above constitute the
delta learning rule which can be
used to train single-layer
networks. A slightly more complex
set of equations exists for learning in
ANNs with more than one layer and
making use of the sigmoid function
described earlier. Figure 4

Unsupervised neural networks have


While many different types of neural
been frequently used in bioinformatics
network exist, they can be generally
as they are a welltested method for
distinguished by the type of learning
clustering. The most common
involved. There are two basic types of
technique used is the self-organising-
learning: supervised and unsupervised
feature-map (SOM or SOFM), and this
learning. In supervised learning, the
learning algorithm consists of units
required behaviour of the neural
and layers arranged in a different
network is known (as described
manner to the feedforward
above). For instance, the input data
backpropagation networks described
might be the share prices of 30
above. The units are arranged in a
companies, and the output may be the
matrix formation known as a map, and
value of the FTSE 100 index. With this
every input unit is connected to every
type of problem, past information
unit in this map. These map units then
about the companies' share price and
form the output of the neural network
the FTSE 100 can be used to train the
(Figure 5).
network. New prices can then be given
to the neural network and the FTSE
100 predicted. With unsupervised
learning, the required output is not
known, and the neural network must
make some decisions about the data
without being explicitly trained.
Generally unsupervised ANNs are
used for finding interesting clusters
within the data. All the decisions made
about those features within the data are
found by the neural network. Figure 4
illustrates the architecture for a two- Figure 5
layer supervised ANN consisting of
an input layer, a ‘hidden’ layer and
an output layer.
The SOFM relies on the notion that
similar individuals in the input data
will possess similar feature
characteristics. The weights are trained
to group those individual records
together which possess similar
features. It is this automated clustering
behaviour which is of interest to
bioinformatics researchers. Numerous
advantages of ANNs have been

8
identified in the AI literature. Neural approach yields a 90% classification
networks can perform with better rate (therefore only 10% error) from
accuracy than equivalent symbolic the final network, which is in contrast
techniques (for instance decision trees) to 75% for each expert.
on the same data. Also, while
identification tree approaches such as
See5 can identify dominant factors the
importance of which can be Another approach has been to operate
represented by their positions high up many different feature selection
in the tree, ANNs may be able to algorithms and a variety of neural
detect non-dominant relationships (i.e. networks on this leukaemia dataset.
relationships involving several factors, Feature selection is a method of
each of which by itself may not be reducing the number of features or
dominant) among the attributes. attributes within the dataset which
However, there are also a number of reduces the time taken to train any
disadvantages. Data often has to be algorithm, but can be very important
pre-processed to conform to the for neural networks. The standard
requirements of the input nodes of neural network with a three layers and
ANNs (e.g. normalised and converted 5-15 hidden units performed very well,
into binary form). Training times can achieving an error rate of just 2.8% on
be long in comparison with symbolic the test dataset, when coupled with a
techniques. Finally, and perhaps most Pearson rank correlation feature
importantly, solutions are encoded in selection.
the weights and therefore are not as
immediately obvious as the rules and
It is obvious from these approaches
trees produced by symbolic
that neural methods are capable of
approaches.
classifying this data with very good
accuracy. The problem with using
neural networks in this manner is that
whilst the network itself can be used a
method for predicting classifications,
Leukaemia Dataset the attributes which have been used to
make the classifications cannot be
The leukaemia dataset consists of 72 easily determined. That is, the model
individuals and 7129 gene expression can be used to predict the class, but we
values is collected from each of them. cannot easily find the genes which are
Classifications consist of individuals responsible for that classification. This
suffering from a type of leukaemia, is especially the case where networks
ALL or AML. Distinguishing between with hidden layers have been used (as
these two types of data is very in the previous experiments) because
important as they respond very the hidden layer can find non-linear
differently to different types of therapy relationships between combinations of
and therefore the survival rate may attributes and classes. This non-linear
well be improved by the correct relationship cannot be easily
classification of this leukaemia. The described in the rule format that has
outward symptoms are very similar, so been shown earlier.
a method which can differentiate
between the two based on gene These were only a few examples.
expression profiles could help patients There a lot more examples which
of the disease. An approach uses prove the importance if AI in
several neural networks trained on Bioinformatics.
different aspects of the data in
conjunction to give a final
classification. It uses the original data
and two fourier transforms of the data
to train three separate "experts", a
gating network then uses a majority
VII. WORKING EXAMPLE
vote to determine the final
classification of these networks. This
9
Functional Genomics and the Robot
Scientist:
 Robot scientist developed by
University of Wales researchers
 Designed for the study of
functional genomics
 Tested on yeast metabolic
pathways
 Utilizes logical and association
to knowledge representation
schemes

Figure 6.b

CONCLUSION
The Robot scientist
In the nutshell I want to conclude
Yeast Metabolic Pathways about the basics of the term paper
that though Bioinformatics is a
totally different field, it depends
on Artificial Intelligence for
speed, accuracy, inference and
prediction. Artificial intelligence
aims at providing tools and
techniques which can help
bioinformatics accomplish its
tasks with convenience.

ACKNOWLEDGMENT

I Diwaker Singh Student of BTech.-


MBA (CSE) would like to thank my
teacher for the corresponding subject
that is Mr. Vijay Kumar Garg for his
guidance during the preparation of this
term paper. This term paper has surely
enhanced my knowledge. I would also
Figure 6.a like to thank my friends who helped
me in gathering information about the
topic.

10
REFERENCES
[1]www.wikipedia.com

[2]www. iiconference.org

[3] www.sciencedirect.com

[4] www.bioplanet.com

[5] www.aaai.org

11

You might also like