You are on page 1of 41

CIS427: Introduction to Bioinformatics

The application of computer science in Bioinformatics


Dr V. C. Osamor Department of Computer & Information Science (Bioinformatics unit) Covenant University vcosamor@yahoo.com
Dr V.C. Osamor CIS427

Objectives
To expose the students to possible areas of computer techniques applicable in Bioinformatics. To expose students to how prediction is important in solving Biology problem. To expose students to how machine learning is applicable in Bioinformatics To expose students to how Expert System is applicable in Bioinformatics.
Dr V.C. Osamor CIS427

The prediction Problem


Make any sentence from English language
We know that letters do not occur in English at random

(e.g. e is more common than z)

Predicting symbols is fundamental to a wide range of

important applications (e.g. encryption, compression)

Dr V.C. Osamor

CIS427

Prediction in bioinformatics
This will largely require development of your own computational tools or the use of existing tools in areas such as: Predicting the location of genes in DNA Predicting gene roles in an organism Predicting errors in a genetic transcription Predicting the function of proteins Predicting diseases from molecular samples Anything that involves making a judgment; a yes/no decision about whether some sample datum does or does not have some property.
Dr V.C. Osamor CIS427

Representation
DATA - DATA

0101011101100101011001010111010000101101
to the computer, everything is binary!

Dr V.C. Osamor

CIS427

0101011101100101011001010111010000101101 0101101100100111111011010011010000101101 A AC GT CA T T CGA T GAT T CGA Just as we can teach a computer to predict things about a sequence of letters in English prose, we can also teach it to predict things about other sequenceslike a genetic sequence

Dr V.C. Osamor

CIS427

A genetic prediction problem


ttgcaatcggcgctacgcttcaaaatttattatattcccggc gcggctacgttcatcccagcagcagcgattttaaaattaa cgcatcagactctcgtcgcgttcgtcgcctttattcacgcta atggacgacatcttttactacgacggcgcctacgcatcg cagcatacgacgcccagcatagtattttagaggcgagg acatcatcatatcgcagctacagcgcatcagacgcata cgacgacgactacgacgacactaacgacgatgttgcg cacccacaccagttatatagagacgaactcgcatcagc
Dr V.C. Osamor CIS427

A genetic prediction problem


ttgcaatcggcgctacgcttcaaaatttattatattcccggcgcggctacgttcatcccagcagcagcgattttaaaattaacgcatcagactctcgtcgcgttcgtcg cctttattcacgctaatggacgacatcttttactacgacggcgcctacgcatcgcagcatacgacgcccagcatagtattttagaggcgaggacatcatcatatcgc agctacagcgcatcagacgcatacgacgacgactacgacgacactaacgacgatgttgcgcacccacaccagttatatagagacgaactcgcatcagctgc aatcggcgctacgcttcaaaatttattatattcccggcgcggctacgttcatcccagcagcagcgattttaaaattaacgcatcagactctcgtcgcgttcgtcgccttt attcacgctaatggacgacatcttttactacgacggcgcctacgcatcgcagcatacgacgcccagcatagtattttagaggcgaggacatcatcatatcgcagct acagcgcatcagacgcatacgacgacgactacgacgacactaacgacgatgttgcgcacccacaccagttatatagagacgaactcgcatcagtgcaatcg gcgctacgcttcaaaatttattatattcccggcgcggctacgttcatcccagcagcagcgattttaaaattaacgcatcagactctcgtcgcgttcgtcgcctttattca cgctaatggacgacatcttttactacgacggcgcctacgcatcgcagcatacgacgcccagcatagtattttagaggcgaggacatcatcatatcgcagctacag cgcatcagacgcatacgacgacgactacgacgacactaacgacgatgttgcgcacccacaccagttatatagagacgaactcgcatcagtgcaatcggcgct acgcttcaaaatttattatattcccggcgcggctacgttcatcccagcagcagcgattttaaaattaacgcatcagactctcgtcgcgttcgtcgcctttattcacgcta atggacgacatcttttactacgacggcgcctacgcatcgcagcatacgacgcccagcatagtattttagaggcgaggacatcatcatatcgcagctacagcgcat cagacgcatacgacgacgactacgacgacactaacgacgatgttgcgcacccacaccagttatatagagacgaactcgcatcagtgcaatcggcgctacgct tcaaaatttattatattcccggcgcggctacgttcatcccagcagcagcgattttaaaattaacgcatcagactctcgtcgcgttcgtcgcctttattcacgctaatgga cgacatcttttactacgacggcgcctacgcatcgcagcatacgacgcccagcatagtattttagaggcgaggacatcatcatatcgcagctacagcgcatcaga cgcatacgacgacgactacgacgacactaacgacgatgttgcgcacccacaccagttatatagagacgaactcgcatcagtgcaatcggcgctacgcttcaa aatttattatattcccggcgcggctacgttcatcccagcagcagcgattttaaaattaacgcatcagactctcgtcgcgttcgtcgcctttattcacgctaatggacgac atcttttactacgacggcgcctacgcatcgcagcatacgacgcccagcatagtattttagaggcgaggacatcatcatatcgcagctacagcgcatcagacgcat acgacgacgactacgacgacactaacgacgatgttgcgcacccacaccagttatatagagacgaactcgcatcagtgcaatcggcgctacgcttcaaaatttat tatattcccggcgcggctacgttcatcccagcagcagcgattttaaaattaacgcatcagactctcgtcgcgttcgtcgcctttattcacgctaatggacgacatctttt actacgacggcgcctacgcatcgcagcatacgacgcccagcatagtattttagaggcgaggacatcatcatatcgcagctacagcgcatcagacgcatacga cgacgactacgacgacactaacgacgatgttgcgcacccacaccagttatatagagacgaactcgcatcagtgcaatcggcgctacgcttcaaaatttattatatt cccggcgcggctacgttcatcccagcagcagcgattttaaaattaacgcatcagactctcgtcgcgttcgtcgcctttattcacgctaatggacgacatcttttactac gacggcgcctacgcatcgcagcatacgacgcccagcatagtattttagaggcgaggacatcatcatatcgcagctacagcgcatcagacgcatacgacgacg actacgacgacactaacgacgatgttgcgcacccacaccagttatatagagacgaactcgcatcagtgttgcgcacccacaccagttatatagagacgaactc

Dr V.C. Osamor

CIS427

A genetic prediction problem

A gene encodes a protein

It is a blueprint that provides biochemical instructions on how to construct a sequence of amino acids so as to make a working protein that will perform some function in the organism
Dr V.C. Osamor CIS427

A genetic prediction problem

untranslated region

encoding region

transcription
factor

Dr V.C. Osamor

CIS427

RNA RNA RNA RN R

A genetic prediction problem

untranslated region

Dr V.C. Osamor

CIS427

A genetic prediction problem

untranslated region ttgcaatcggcgctacgcttcaaaatttattatattcccggc

Dr V.C. Osamor

CIS427

A genetic prediction problem

ttgcaatcggcgctacgcttcaaaatttattatattcccggc

What transcription factors bind to this gene? Where is the transcription factor binding site?
Dr V.C. Osamor CIS427

A genetic prediction problem

ttgcaatcggcgctacgcttcaaaatttattatattcccggc Clues: A binding site is often a short general pattern


E.g. CCGATNATCGG

Dr V.C. Osamor

CIS427

A genetic prediction problem

ttgcaatcggcgctacgcttcaaaatttattatattcccggc Clues: The patterns are often reverse complements


E.g. CCGATNATCGG GGCTANTAGCC
Dr V.C. Osamor CIS427

A genetic prediction problem

ttgcaatcggcgctacgcttcaaaatttattatattcccggc Clues: Where there is one binding site, often there is another nearby.

Dr V.C. Osamor

CIS427

A genetic prediction problem -ALGORITHMS and DATA STRUCTURES


All of these properties are the kinds of things for which computer science has developed algorithms and data structures to identify quickly and efficiently, and therefore it is exactly the kind of problem computer scientists should be able to solve.

Dr V.C. Osamor

CIS427

Proteomics
Three consecutive nucleotides in the coding region form a codon i.e. encode an amino acid. A string of amino acids makes a protein.

3 nucleotides, 4 possibilities each: ACTG

43 = 64 possible codons
But there are only 20 amino acids!

Dr V.C. Osamor

CIS427

proteomics
There is quite a bit of redundancy in codons.

Glycine: GGA, GGC, GGG, GGT Tyrosine: TAT, TAC Methionine: ATG

Dr V.C. Osamor

CIS427

Amino Acid
R group

Amide group

Carboxyl group

Dr V.C. Osamor

CIS427

Amino Acid

tyrosine

glycine

Dr V.C. Osamor

CIS427

Dr V.C. Osamor

CIS427

Dr V.C. Osamor

CIS427

Dr V.C. Osamor

CIS427

Artificial Intelligence
Computers do things only human brains can otherwise do
expert system expert

Dr V.C. Osamor

CIS427

Artificial Intelligence & Machine Learning


Computers do things only human brains can otherwise do
expert system learning system

Dr V.C. Osamor

CIS427

Machine learning
What is machine learning?
creating computer programs that get better with experience learn how to make expert judgments discover previously hidden, potentially useful information (data mining)

How does it work?


user provides learning system with examples of concept to be learned induction algorithm infers a characteristic model of the examples model is used to predict whether or not future novel instances are also examples and it does this very consistently, and very, very quickly!
Dr V.C. Osamor CIS427

Statistical and Mathematical techniques


Statistical and mathematical techniques, ranging from exact, heuristics, fixed parameter and approximation algorithms for problems based on parsimony models To Markov Chain Monte Carlo algorithms for Bayesian analysis of problems based on probabilistic models.
Dr V.C. Osamor CIS427

OTHER BIOINFORMATICS APPLICATIONS

Dr V.C. Osamor

CIS427

Bioinformatics Applications
Bioinformatics was applied in the creation and

maintenance of a database to store biological information at the beginning of the "genomic revolution", such as nucleotide and amino acid sequences. Development of this type of database involved not only design issues but the development of complex interfaces whereby researchers could both access existing data as well as submit new or revised data.
Dr V.C. Osamor CIS427

Biotechnology
Biologists know proteins, computer scientists know machine learning Together, they can find out a lot of hidden information about genes and proteins Biotechnology is a multi-billion dollar industry Biotechnology is one of the best funded areas of scientific research
Dr V.C. Osamor CIS427

Modelling a Biological system


There are two fundamental ways of modelling a Biological system (e.g. living cell) both coming under Bioinformatic approaches. Static
Sequences - Proteins, Nucleic acids and Peptides Structures - Proteins, Nucleic acids, Ligands (including metabolites and drugs) and Peptides Interaction data among the above entities including microarray data and Networks of proteins, metabolites
Dr V.C. Osamor CIS427

Modelling Biological System


Dynamic
Systems Biology comes under this category including reaction fluxes and variable concentrations of metabolites Multi-Agent Based modelling approaches capturing cellular events such as signalling, transcription and reaction dynamics

Dr V.C. Osamor

CIS427

Major Research Efforts


Major research efforts in the field include sequence alignment, gene finding, genome assembly, drug design, drug discovery, protein structure alignment, protein structure prediction, prediction of gene expression and proteinprotein interactions, genome-wide association studies and the modeling of evolution.
Dr V.C. Osamor CIS427

Major research areas


1 Sequence analysis 2 Genome annotation 3 Computational evolutionary biology 4 Analysis of gene expression 5 Analysis of regulation 6 Analysis of protein expression 7 Analysis of mutations in cancer 8 Comparative genomics 9 Modeling biological systems 10 High-throughput image analysis
11 Structural Bioinformatic Approaches
- Prediction of protein structure, Molecular Interaction ,Docking algorithms
Dr V.C. Osamor CIS427

Sequence Analysis
This sequence information is analyzed to determine genes that encode polypeptides (proteins), RNA genes, regulatory sequences, structural motifs, and repetitive sequences. A comparison of genes within a species or between different species can show similarities between species (the use of molecular systematics to construct phylogenetic trees). With the growing amount of data, it became impractical to analyze DNA sequences manually. Today, computer programs such as BLAST are used.
Dr V.C. Osamor CIS427

Genome Annotation/Gene finding


Annotation is the process of marking the genes and other biological features in a DNA sequence. The first genome annotation software system was designed in 1995 by Dr. Owen White, who was part of the team at The Institute for Genomic Research that sequenced and analyzed a free-living organism the bacterium Haemophilus influenzae. He built a software system to find the genes (places in the DNA sequence that encode a protein), the transfer RNA, and other features, and to make initial assignments of function to those genes.
Dr V.C. Osamor CIS427

Computational Evolutionary Bio


Evolutionary biology is the study of the origin and descent of species, as well as their change over time. Informatics has assisted evolutionary biologists in several key ways; it has enabled researchers to: Trace the evolution Compare Genomes Computational prediction model of population over time
Dr V.C. Osamor CIS427

Comparative Genomics
The core of comparative genome analysis is the establishment of the correspondence between genes (orthology analysis) or other genomic features in different organisms.

Dr V.C. Osamor

CIS427

Modelling of biological system


Modelling in Systems biology involves the use of computer simulations of cellular subsystems (such as the networks of metabolites and enzymes which comprise metabolism, signal transduction pathways and gene regulatory networks) to both analyze and visualize the complex connections of these cellular processes.
Dr V.C. Osamor CIS427

High throughput Imaging


Image analysis systems augment an observer's ability to make measurements from complex set of images, by improving accuracy, objectivity, or speed. Biomedical imaging is becoming more important for both diagnostics and research.

Dr V.C. Osamor

CIS427

You might also like