Professional Documents
Culture Documents
This review paper would not have been possible without the help & encouragement
of many individuals. It gives me tremendous pleasure in acknowledging the valuable
assistance extends to me by various people in the successful completion of this program.
First & foremost I wish to acknowledge my profound thanks to our respected teacher Prof.
Bikash Ranjan Pati; Head of Microbiology Dept.,Vidyasagar Universiy; Dr Keshab
Ch.Mondal & Dr. Debdulal Banerjee; lecturer Dept.of Microbiology, Vidyasagar University
for their constant encouragement and valuable guidance & suggestions.
I also express my thanks to Mr. Pradeep DasMahapatra, & Mr.Jhereswer
Chalak, Laboratory Assistant of The Dept. of Microbiology, Vidyasagar University.
I like to express my special thanks to Mr. Paltu Dhal, Scholar, IIT; Kharagpur
for his valuable advice and support.
I must express my deepest regard to my parents for their continuous blessing and
support.
Last but not the least, I would like to thanks all of my friends and my department who
gave me timely support due to which the seminar report has been completed successfully.
..............................................................
Subject Page No
Abstract 1
Introduction 1
Basic Terminology in Phylogenetic Tree 4
The Concept of Evolutionary Trees 4
Fundamental Elements of Phylogenetic Models 7
Tree Interpretation—the Importance of Identifying 8
Paralogs and Orthologs
Genome Complexity and Phylogenetic Analysis 9
Phylogenetic Data Analysis 13
Relationship of Phylogenetic Analysis to Sequence Alignment 13
Tree-Building Methods 15
Outline of Tree-Building Method 15
Character-Based Methods 16
Maximum Parsimony Method 17
Maximum Likelihood Method 20
Distance-Based Methods 21
Unweighted Pair Group Method With Arithmetic Mean (UPGMA) 22
Neighbor Joining (NJ) 22
Fitch-Margoliash (FM) 23
Minimum Evolution (ME) 23
Which Distance-Based Tree-Building Procedure Is Best? 23
The Difference between Distance, Parsimony, and
Maximum Likelihood Methods 24
Pitfalls of Phylogenetic Analysis 24
Tree Evaluation 25
Reliability of Phylogenetic Predictions 25
Complications from Phylogenetic Analysis 26
Reference 27
Phylogenetic analysis- a bioinformatics tool
Abstract:
Phylogenetics is the study of evolutionary relationships. Phylogenetic analysis is
the means of inferring or estimating these relationships. The evolutionary history inferred
from phylogenetic analysis is usually depicted as branching, treelike diagrams that
represent an estimated pedigree of the inherited relationships among molecules,
organisms, or both. The phylogenetic relationship between two or more sets of
sequences is often extremely important information for bioinformatics analyses such as
the construction of sequence alignment. It is not unreasonable to think of the kind of
molecular data being something of a historical document that contains within it evidence
of the important steps in the evolution of a gene. The very same evolutionary events
(substitutions, insertions, deletions, and rearrangements) that are important to the
history of a gene can also be used to resolve questions about the evolutionary history
and relationship between entire species. In fact, the phylogenetic relationships among
many kinds of organisms are difficult to determine in any another way. The true
relationship between homologous sequences is hardly ever known aside from
computer simulation experiments. A variety of approaches are available for inferring the
most likely phylogenetic relationship between genes and species using nucleotide and
protein sequence information. Thus bioinformatics tend to determine statistically based,
commonly used inferring evolutionary relationship.
Introduction:
Phylogenetic analysis of nucleic acid and protein sequences is presently and will
continue to be an important area of sequence analysis. In addition to analyzing changes
that have occurred in the evolution of different organisms, the evolution of a family of
sequences may be studied. On the basis of the analysis, sequences that are the most
closely related can be identified by their occupying neighboring branches on a tree.
When a gene family is found in an organism or group of organisms, phylogenetic
relationships among the genes can help to predict which ones might have an equivalent
function. These functional predictions can then be tested by genetic experiments.
Phylogenetic analysis may also be used to follow the changes occurring in a rapidly
changing species, such as a virus. Analysis of the types of changes within a population
can reveal, for example, whether or not a particular gene is under selection (McDonald
and Kreitman 1991; Comeron and Kreitman 1998; Nielsen and Yang 1998), an important
source of information in applications like epidemiology.
Molecular phylogenetics predates DNA sequencing by several decades. It is
derived from the traditional method for classifying organisms according to their
similarities and difference, as first practiced in a comprehensive fashion by Linnaeus in
the 18th century. Linnaeus was a systematicist not an evolutionist, his objective being to
place all known organisms into a logical classification which he believed would reveal
the great plan used by the Creator –the Systema Naturae in 1753. His framework
schemes dividing organism into a series of taxonomic categories, starting with kingdom
and progressing down though phylum, class, order, family and genus to species. The
naturalists of the 18th and early 19th centuries linked this hierarchy to a ‘tree of life’, an
analogy that was adopted by Darwin (1859) in The Origin of Species as a means of
describing the interconnected evolutionary histories of living organisms. The
classificatory scheme devised by Linnaeus therefore became reinterpreted as a
phylogeny indicating not just the similarity between species but also their evolutionary
relationship.
1
Phylogenetic analysis- a bioinformatics tool
2
Phylogenetic analysis- a bioinformatics tool
reviews of phylogenetics (Saitou, 1996; Li, 1997; Swofford et al., 1996). An especially
concise introduction to molecular phylogenetics is provided by Hillis et al. (1993).
The danger of generating incorrect results is inherently greater in computational
phylogenetics than in many other fields of science. The events yielding a phylogeny
happened in the past and can only be inferred or estimated (with a few exceptions, Hillis
et al., 1994). Despite the well-documented limitations of available phylogenetic
procedures, current biological literature is replete with examples of conclusions derived
from the results of analyses in which data had been simply run through one or another
phylogeny program. Occasionally, the limiting factor in phylogenetic analysis is not so
much the computational method used; more often than not, the limiting factor is the
users’ understanding of what the method is actually doing with the data.
This brief guide to phylogenetic analysis has several objectives. First, a
conceptual approach that describes some of the most important principles underlying the
most widely and easily applied methods of phylogenetic analyses of biological
sequences and their interpretation will be introduced. The aim is to show that practical
phylogenetic analysis should be conceived as a search for a correct model, as much as
a search for the correct tree. In this context, some of the particular models assumed by
various popular methods and how these models might affect analysis of particular data
sets will be discussed. Finally, some examples of the application of particular methods to
the inferences of evolutionary history are provided.
Phylogenetic analysis programs are widely available at little or no cost. A
comprehensive list will not be given here since one has been published previously
(Swofford et al. 1996). The main ones in use are PHYLIP (phylogenetic inference
package) (Felsenstein 1989 1996) available from Dr. J. Felsenstein at
www.evolution.genetics.washington.edu/ phylip and PAUP (phylogenetic analysis using
parsimony) available from Sinauer Associates, Sunderland, Massachusetts,
www.lms.si.edu/PAUP/. Current versions of these programs provide the three main
methods for phylogenetic analysis—parsimony, distance, and maximum likelihood
methods —and also include many types of evolutionary models for sequence variation.
Each program requires a particular type of input sequence format. Another program,
MacClade, is useful for detailed analysis of the predictions made by PHYLIP, PAUP, and
other phylogenetic programs and is also available from Sinauer
( www.phylogeny.arizona.edu/macclade/ macclade). MacClade, as the name suggests,
runs on a Macintosh computer. PHYLIP and PAUP run on practically any machine, but
the user interface for PAUP has been most developed for use on the Macintosh
computer. There are also several Web sites that provide information on phylogenetic
relationships among organisms. There are several excellent descriptions of phylogenetic
analysis in which the methods are covered in considerable depth (Li and Graur 1991;
Miyamoto and Cracraft 1991; Felsenstein 1996; Li and Gu 1996; Saitou 1996; Swofford
et al. 1996; Li 1997).
3
Phylogenetic analysis- a bioinformatics tool
4
Phylogenetic analysis- a bioinformatics tool
The tree is composed of outer branches (or leaves) representing the taxa and
nodes and branches representing relationships among the taxa, illustrated as sequences
A–D in Figure:1. Thus, sequences A and B are derived from a common ancestor
sequence represented by the node below them, and C and D are similarly related. The
A/B and C/D common ancestors also share a common ancestor represented by a node
at the lowest level of the tree. It is important to recognize that each node in the tree
represents a splitting of the evolutionary path of the gene into two different species that
are isolated reproductively. Beyond that point, any further evolutionary changes in each
new branch are independent of those in the other new branch. The length of each
branch to the next node represents the number of sequence changes that occurred prior
to the next level of separation. Note that, in this example, the branch length between the
A/B node and A is approximately equal to that between the A/B node and B, indicating
the species are evolving at the same rate.
The amount of evolutionary time that has transpired since the separation of A
and B is usually not known. What is estimated by phylogenetic analysis is the amount of
sequence change between the A/B node and A, and also between the A/B node and B.
Hence, judging by the branch lengths from this node to A and B, the same number of
sequence changes has occurred. However, it is also likely that for some biological or
environmental reason unique to each species, one taxon may have undergone more
mutations since diverging from the ancestor than the other. In this case, different branch
lengths would be shown on the tree. Some types of phylogenetic analyses assume that
the rates of evolution in the tree branches are the same, whereas others assume that
they vary, as discussed below. The assumption of a uniform rate of mutation in the tree
branches is known as the molecular clock hypothesis and is usually most suitable for
closely related species (Li and Graur 1991; Li 1997). Tests for this hypothesis have been
5
Phylogenetic analysis- a bioinformatics tool
6
Phylogenetic analysis- a bioinformatics tool
7
Phylogenetic analysis- a bioinformatics tool
There are additional assumptions that are defaults in some methods but can be
at least partially corrected for in others:
1. The sequences in the sample evolved according to a single stochastic
process.
2. All positions in the sequence evolved according to the same stochastic
process.
3. Each position in the sequence evolved independently.
Errors in published phylogenetic analyses can often be attributed to violations
of one or more of the foregoing assumptions. Every sequence data set must be
evaluated against these assumptions, with other possible explanations for the
observed results considered.
8
Phylogenetic analysis- a bioinformatics tool
9
Phylogenetic analysis- a bioinformatics tool
Figure 4. Rooted tree of life showing principal relationships among prokaryotic domains
Bacteria and Archaea (Woese 1987; Barns et al. 1996; Brown and Doolittle 1997).
Branch lengths are approximate only. Species that have been sequenced or are being
sequenced are shown.
10
Phylogenetic analysis- a bioinformatics tool
Although these studies of rRNA sequences suggest a quite clear-cut model for
the evolution of life, phylogenetic analysis of other genes and gene families has revealed
that the situation is probably more complex and that a more appropriate model might be
the one shown in Figure 5. There are now many examples of horizontal or lateral
transfer of genes between species that introduce new genes and sequences into an
organism (Brown and Doolittle 1997; Doolittle 1999). These types of transfers are
inferred from the finding that the phylogenetic histories of different genes in an organism,
such as genes for metabolic functions, are not the same or that codon use in different
genes varies. Another type of phylogenetic analysis is based on the number of genes
shared between genomes and produces a tree that is similar to the rRNA tree (Snel et
al. 1999).
Figure 5: The reticulated or net-like form of the tree of life. Analysis of rRNA sequences
originally suggested three main branches in the tree of life, Archaea, Bacteria, and
Eukarya. Subsequent phylogenetic analysis of genes for some metabolic enzymes is not
congruent with the rRNA tree. Hence, for these metabolic genes, the tree has a
reticulated form due to horizontal transfer of these genes between species. (Martin
1999)
11
Phylogenetic analysis- a bioinformatics tool
To track the evolutionary history of genes, more attention has also been paid to
the methodology of phylogenetic analysis and to the inherent errors in many of the
assumptions (Doolittle 1999). Problems associated with variations between rates of
change in different sites and of analyzing more distantly related sequences are
discussed below. Moreover, there is evidence that genomes undergo extensive
rearrangements, placing sequences of different evolutionary origin next to each other
and even causing rearrangements within protein-encoding genes (Henikoff et al. 1997).
The different regions of independent evolutionary origin in a sequence therefore
need to be identified. Proteins are modular with functional domains, sometimes repeated
within a protein and sometimes shared within a protein family. These regions are
identified by their sharing of significant sequence similarity. The remainder of the aligned
regions in the group may have variable levels of similarity. In nucleic acid sequences, a
given sequence pattern may provide a binding site for a regulatory molecule, leading to
promoter function, RNA splicing, or some other function. It may be difficult to decide the
extent of these patterns for phylogenetic analysis.
12
Phylogenetic analysis- a bioinformatics tool
13
Phylogenetic analysis- a bioinformatics tool
14
Phylogenetic analysis- a bioinformatics tool
Tree-Building Methods:
Tree-building methods implemented in available software are discussed in detail
in the literature (Saitou, 1996; Swofford et al., 1996; Li, 1997) and described on the
Internet. This section briefly describes some of the most popular methods. Tree building
methods can be sorted into distance-based vs. character-based methods. Much of the
discussion in molecular phylogenetics dwells on the utility of distance and character-
based methods (e.g., Saitou, 1996; Li, 1997). Distance methods compute pairwise
distances according to some measure and then discard the actual data, using only the
fixed distances to derive trees. Character-based methods derive trees that optimize the
distribution of the actual data patterns for each character. Pairwise distances are,
therefore, not fixed, as they are determined by the tree topology. The most commonly
applied distance-based methods include neighbor-joining and the Fitch-Margoliash
method, and the most common character-based methods include maximum parsimony
and maximum likelihood.
1. The sequences chosen can be either DNA or protein sequence: Different programs and
program options are used for each type. RNA sequences are analyzed by covariation methods
and by analyzing changes in secondary structure. The selected sequences should align with each
other along their entire lengths, or else each should have a common set of patterns or domains
that provides a strong indication of evolutionary relatedness.
2. The alignment of the sequence pairs should not have a large number of gaps that are
obviously necessary to align identical or related characters. A phylogenetic analysis should only
be performed on parts of sequences that can be reasonably aligned. In general, phylogenetic
methods analyze conserved regions that are represented in all the sequences. The more similar
15
Phylogenetic analysis- a bioinformatics tool
the sequences are to each other, the better. The simplest evolutionary models assume that the
variation in each column of the multiple sequence alignment represents single-step changes and
that no reversals (A → T → A) have occurred. As the observed variation increases, more
multiple-step changes (A → T → G) and reversions are likely to be present. Corrections may be
applied for such variation, thereby increasing the observed amount of change to a more
reasonable value. These corrections assume a uniform rate of change at all sequence positions
over time. Gaps in the multiple sequence alignment are usually not scored because there is no
suitable model for the evolutionary mechanisms that produce them.
3. This question is designed to select sequences suitable for maximum parsimony analysis. Other
methods may also be used with these same sequences. For parsimony analysis, the best results
are obtained when the amount of variation among all pairs of sequences is similar (no very
different sequences are present) and when the amount of variation is small. Some columns in the
multiple sequence alignment will have the same residue in all sequences; other columns will
include both conserved and nonconserved residues. There should be a clear-cut majority of
certain residues in some columns of the alignment but also some variation. These more common
residues are taken to represent an earlier group of sequences from which others were derived. If
there is too much variation, there will be too many possible ancestral relationships. Because the
maximum parsimony method has to attempt to fit all possible trees to the data, the method is not
suitable for more than 11 or 12 sequences because there are too many trees to test. More than
one tree may be found to be equally parsimonious. A consensus tree representing the conserved
features of the different trees may then be produced.
4. The purpose of this question is to select sequences for phylogenetic analysis by distance
methods. Distance methods are able to predict an evolutionary tree when variation among the
sequences is present (some sequences are more alike than others) and when the amount of
variation is intermediate. The number of changed positions in an alignment between two
sequences divided by the total number of matched positions is the distance between the
sequences. As distances increase, corrections are necessary for deviations from single-step
changes between sequences. Of course, as distances increase, the uncertainty of alignments
also increases, and a reassessment of the suitability of the multiple sequence alignment method
may be necessary. Sequences with this type of variation may also be suitable for phylogenetic
analysis by maximum likelihood methods. Distance methods may be used with a large number of
sequences. The program CLUSTALW produces a distance-based tree at the same time as a
multiple sequence alignment (Higgins et al. 1996).
5. Maximum likelihood methods may be used for any set of related sequences, but they are
particularly useful when the sequences are more variable. These methods are computationally
intense, and computational complexity increases with the number of sequences since the
probability of every possible tree must be calculated as described in the text. An advantage of
these methods is that they provide evolutionary models to account for the variation in the
sequences.
6. The data in the multiple sequence alignment columns is resampled to test how well the
branches on the evolutionary tree are supported (boot-strapping).
Character-Based Methods:
The character-based methods have little in common with each other, besides the
use of the character data at all steps in the analysis. This allows the assessment of the
reliability of each base position in an alignment on the basis of all other base positions.
16
Phylogenetic analysis- a bioinformatics tool
17
Phylogenetic analysis- a bioinformatics tool
than ML when the amount of sequence evolution since lineages diverged is much
greater than the amount of divergence that occurred between lineage splits (i.e., in a
tree with very long terminal branches and short internal internodes) (Huelsenbeck,
1995). This condition produces ‘‘long branch attraction’’—the long branches become
artificially connected because the number of nonhomologous similarities the sequences
have accumulated exceeds the number of homologous similarities they have retained
with their true closest relatives (Swofford et al., 1996). Character weighting improves the
performance of MP under these conditions (Huelsenbeck, 1995).
Table 1: Example of phylogenetic analysis to find the correct unrooted tree from four
aligned sequences by the maximum parsimony method (Li and Graur 1991)
18
Phylogenetic analysis- a bioinformatics tool
Figure 9: Example of phylogenetic analysis using the maximum parsimony method. (Li
and Graur 1991). This figure shows an example of phylogenetic analysis by maximum
parsimony. This method finds the tree that changes any sequence into all of the others
by the least number of steps.
Features:
¾ Multiple sequence alignment needed
¾ Used for rather similar sequences to be analyzed in small numbers
¾ Time-consuming and computationally costly
¾ Widely used software in the field that implements Maximum parsimony and other
methods is PHYLIP
http://evolution.genetics.washington.edu/phylip.html
Optimality criterion: The ‘most-parsimonious’ tree is the one that requires the fewest
number of evolutionary events (e.g., nucleotide substitutions, amino acid replacements)
to explain the sequences.
Advantages:
¾ Are simple, intuitive, and logical (many possible by ‘pencil-and-paper’).
¾ Can be used on molecular and non-molecular (e.g., morphological) data.
¾ Can tease apart types of similarity (shared-derived, shared-ancestral,
homoplasy)
¾ Can be used for character (can infer the exact substitutions) and rate analysis.
¾ Can be used to infer the sequences of the extinct (hypothetical) ancestors.
Disadvantages:
¾ Can be fooled by high levels of homoplasy (‘same’ events).
¾ Can become positively misleading in the “Felsenstein Zone”:
19
Phylogenetic analysis- a bioinformatics tool
20
Phylogenetic analysis- a bioinformatics tool
studies, ML has consistently outperformed ME and MP when the data analysis proceeds
according to the same model that generates the data (Huelsenbeck, 1995). ML will
always be the most computationally intensive method of all, however, so there will
always be situations in which it is not practical.
Features:
¾ Uses probability calculations to find a tree that best accounts for the observed
sequence variations.
¾ All possible trees are considered (time-consuming)
¾ Few sequences can be analyzed
¾ It is possible to evaluate trees with mutations in different lineages
¾ Use evolutionary models that allow for variations in base composition (Jukes-
Cantor, Kimura)
Advantages:
¾ Are inherently statistical and evolutionary model-based.
¾ Usually the most ‘consistent’ of the methods available.
¾ Can be used for character (can infer the exact substitutions) and rate analysis.
¾ Can be used to infer the sequences of the extinct (hypothetical) ancestors.
¾ Can help account for branch-length effects in unbalanced trees.
¾ Can be applied to nucleotide or amino acid sequences, and other types of data.
Disadvantages:
¾ Are not as simple and intuitive as many other methods.
¾ Are computationally very intense (Iimits number of taxa and length of sequence).
¾ Like parsimony, can be fooled by high levels of homoplasy.
¾ Violations of the assumed model can lead to incorrect trees.
Distance-Based Methods:
The distance method employs the number of changes between each pair in a
group of sequences to produce a phylogenetic tree of the group. The sequence pairs
that have the smallest number of sequence changes between them are termed
“neighbors.” On a tree, these sequences share a node or common ancestor position and
are each joined to that node by a branch. The goal of distance methods is to identify a
tree that positions the neighbors correctly and that also has branch lengths which
reproduce the original data as closely as possible. Finding the closest neighbors among
a group of sequences by the distance method is often the first step in producing a
multiple sequence alignment.
The distance method was pioneered by Feng and Doolittle, and a collection of
programs by these authors will produce both an alignment and tree of a set of protein
sequences (Feng and Doolittle 1996). The program CLUSTALW, uses the neighbor-
joining distance method as a guide to multiple sequence alignments. PAUP version 4
has options for performing a phylogenetic analysis by distance methods. Programs of
21
Phylogenetic analysis- a bioinformatics tool
the PHYLIP package that perform a distance analysis which automatically read in a
sequence in the PHYLIP infile format and automatically produce a file called outfile with
a distance table.
Distance-based methods use the amount of dissimilarity (the distance) between
two aligned sequences to derive trees. A distance method would reconstruct the true
tree if all genetic divergence events were accurately recorded in the sequence (Swofford
et al., 1996). However, divergence encounters an upper limit as sequences become
mutationally saturate. After one sequence of a diverging pair has mutated at a particular
site, subsequent mutations in either sequence cannot render the sites any more
‘‘different.’’ In fact, subsequent mutations can make them again equal (for example, if a
valine mutates to an isoleucine, which mutates back to a valine). Therefore, most
distance-based methods correct for such ‘‘unseen’’ substitutions. In practice, application
of the rate matrix effectively presumes that some proportion of observed pairwise base
identities actually represents multiple mutations and that this proportion increases with
increasing overall sequence divergence. Some programs implement, at least optionally,
calculation of uncorrected distances, whereas, for example, the MEGA program (Kumar
et al., 1994) implements only uncorrected distances for codon and amino acid data.
Unless overall divergences are very low, the latter approach is virtually guaranteed to
give inaccurate results.
Pairwise distance is calculated using maximum-likelihood estimators of
substitution rates. The most popular distance tree-building programs have a limited
number of substitution models, but PAUP 4.0 implements a number of models, including
the actual model estimated from the data using maximum likelihood, as well as the
logdet distance method.
Distance methods are much less computationally intensive than maximum
likelihood but can employ the same models of sequence evolution. This is their biggest
advantage. The disadvantage is that the actual character data are discarded. The most
commonly applied distance-based methods are the unweighted pair group method with
arithmetic mean (UPGMA), neighbor joining (NJ), and methods that optimize the
additivity of a distance tree, including the minimum evolution (ME) method. Several
methods are available in more than one phylogenetics software package but not all
implementations allow the same parameter specifications and/or tree optimization
features (e.g., branch swapping).
22
Phylogenetic analysis- a bioinformatics tool
Figure 10. Star decomposition. This is how tree-building algorithms such as neighbor
joining work. The most similar terminals are joined, and a branch is inserted between
them and the remainder of the star. Subsequently, the new branch is consolidated so
that its value is a mean of the two original values, yielding a star tree with n-1 terminals.
The process is repeated until only one terminal remains.
Fitch-Margoliash (FM):
The Fitch-Margoliash (FM) method seeks to maximize the fit of the observed
pairwise distances to a tree by minimizing the squared deviation of all possible observed
distances relative to all possible path lengths on the tree (Felsenstein, 1997). There are
several variations that differ in how the error is weighted. The variance estimates are not
completely independent because errors in all the internal tree branches are counted at
least twice (Rzhetsky and Nei, 1992).
23
Phylogenetic analysis- a bioinformatics tool
the nucleotide data under a more realistic model might be preferable to MEGA’s
methods.
Simulation studies indicate that UPGMA performs poorly over a broad range of
tree shape space (Huelsenbeck, 1995). The use of this method is not recommended; it
is mentioned here only because its application seems to persist, as evidenced by
UPGMA gene trees appearing in publications (Huelsenbeck, 1995).
NJ is clearly the fastest procedure and generally yields a tree close to the ME
tree. (Rzhetsky and Nei, 1992; Li, 1997). However, it yields only one tree. Depending on
the structure of the data, numerous different trees might be as good or significantly
better than the NJ tree (Swofford et al., 1996).
24
Phylogenetic analysis- a bioinformatics tool
most problems in molecular biology, which can be solved by acquiring more data, long
branch attractions are much more complex. If longer sequences are used when long
branch attractions are present, the incorrect solution will be even more strongly
supported. Of the three pitfalls, alignment artefacts are potentially the most serious,
because even if the second and third problems are solved, the misalignments can still
produce incorrect trees. A new algorithm, paralinear (logdet) distances (Lockhart, P.J.
et al. 1994 and Lake, J.A.;1994) provides a simple, but rigorous, mathematical solution
for the third pitfall. This particular algorithm is now available in some of the phylogenetic
packages described below. (For a discussion of many other useful algorithms that are
available, including maximum parsimony, maximum likelihood and other distance
methods) (Stewart, C-B; 1993).
Tree Evaluation:
Several procedures are available that evaluate the phylogenetic signal in the
data and the robustness of trees (Swofford et al., 1996; Li, 1997). The most popular of
the former class are tests of data signal versus randomized data (skewness and
permutation tests). The latter class includes tests of tree support from resampling of
observed data (nonparametric bootstrap). The likelihood ratio test provides a means of
evaluating both the substitution model and the tree.
25
Phylogenetic analysis- a bioinformatics tool
methods provide the same prediction, confidence in the prediction is much higher.
Another recommendation is to pay careful attention to the evolutionary assumptions and
models that are used for both sequence alignment and tree construction (Li and Graur
1991; Swofford et al. 1996; Li 1997).
26
Phylogenetic analysis- a bioinformatics tool
Reference:
Book reference:
Baxevanis Andreas D. & Ouellette B. F. Francis ;BIOINFORMATICS A Practical Guide
to the Analysis of Genes and Proteins (Wiley, 2001).
Brown T.A. Genome 2, Wiley-Liss (2002).
Mount. David W. - Bioinformatics Sequence and Genome Analysis, Cold Spring Harbor
Laboratory Press.
Journal reference:
Adachi, J., and Hasegawa, M. (1996). MOLPHY Version 2.3. Programs for Molecular
phylogenetics based on maximum likelihood (Tokyo: Institute of Statistical Mathematics).
assessing confidence in phylogenetic analysis. Syst. Biol. 42, 182–192.
Barns S.M., Delwiche C.F., Palmer J.D., and Pace N.R. 1996. Perspectives on archaeal
diversity, thermophily and monophyly from environmental rRNA sequences. Proc. Natl.
Acad. Sci. 93: 9188–9193.
Brown J.R. and Doolittle W.F. 1997. Archaea and the procaryotic-to-eukaryote transition.
Microbiol. Mol. Biol. Rev. 61: 456–502.
Comeron J.M. and Kreitman M. 1998. The correlation between synonymous and
nonsynonymous substitutions in Drosophila: Mutation, selection or relaxed constraints?
Genetics 150: 767–775.
Darwin C (1859) The Origin Of Species by Means of Natural Selection, or the Prevention
of Favoured Races in the Strugglr for Life, Penguin Books, London.
Doolittle W.F. 1999. Phylogenetic classification and the universal tree. Science 284:
2124–2128. evolution trees. Mol. Biol. Evol. 9, 945–967.
Eerniss DJ (1998) A brief guide to phylogenetic software.Trends Genet, 14, 473-475.
Felsenstein J. 1981. Evolutionary trees from DNA sequences: A maximum likelihood
approach. J. Mol. Evol. 17: 368–376.
———. 1988. Phylogenies from molecular sequences: Inferences and reliability. Annu.
Rev. Genet. 22: 521–565.
———. 1989. PHYLIP: Phylogeny inference package (version 3.2). Cladistics 5: 164–
166.
———. 1996. Inferring phylogeny from protein sequences by parsimony, distance and
likelihood methods. Methods Enzymol. 266: 368–382.
Feng D.F. and Doolittle R.F. 1996. Progressive alignment of amino acid sequences and
construction of phylogenetic trees from them. Methods Enzymol. 266: 368–382.
Feng, D. F., and Doolittle, R. F. (1996). Progressive alignment of amino acid sequences
and construction of phylogenetic trees from them. Methods Enzymol. 266, 368–382.
Fitch W.M. 1981. A non-sequential method for constructing trees and hierarchical
classifications. J. Mol. Evol. 18: 30–37.
Fitch W.M. and Margoliash E. 1987. Construction of phylogenetic trees. Science 155:
279–284
Hein J. and Støvlbæk J. 1996. Combined DNA and protein alignment. Methods
Enzymol. 266: 402–418.
Henikoff S., Greene E.A., Pietrokovski S., Bork P., Attwood T.K., and Hood L. 1997.
Gene families: The taxonomy of protein paralogs and chimeras. Science 278: 609–614.
Hennig W (1966) Phylogenetic Systematics. University of lllinois Press, Uroana, IL.
Hershkovitz, M.A., and Lewis, L.A. (1996). Deep-level diagnostic value of the rDNA ITS
region. Mol. Biol. Evol. 13, 1276–1295.
27
Phylogenetic analysis- a bioinformatics tool
Higgins D.G., Thompson J.D., and Gibson T.J. 1996. Using CLUSTAL for multiple
sequence alignments. Methods Enzymol. 266: 383–402
Hillis, D. M., Allard, M. W., and Miyamoto, M. M. (1993). Analysis of DNA sequence
data: Phylogenetic inference. Methods Enzymol. 224, 456–487.
Hills DM (1997) Biology recapitulates phylogeny. Science, 267, 218-219.
Hillis, D. M., and Bull, J. J. (1993). An empirical test of bootstrapping as a method for
Hillis, D. M., Huelsenbeck, J. P., and Cunningham, C. W. (1994). Application and
accuracy of molecular phylogenies. Science 264, 671–677.
Huelsenbeck, J. P. (1995). Performance of phylogenetic methods in simulation. Syst.
Biol. 44, 17–48.
Huelsenbeck, J. P., Hillis, D. M., and Jones, R. (1996). Parametric bootstrapping in
molecular phylogenetics. In Molecular Zoology: Advances, Strategies, and Protocols, J.
D. Ferraris and S. R. Palumbi, Eds. (New York: Wiley-Liss), p. 19–45.
Kumar, S., Tamura, K., and Nei, M. (1994). MEGA: Molecular Evolutionary Genetics
Analysis software for microcomputers. Comput. Appl. Biosci. 10, 189–191.
Lake, J.A. (1994) Proc. Natl.Acad. Sci.U. S.A. 91,1455–1459
Li W.-H. 1993. Unbiased estimation of the rates of synonymous and nonsynonymous
substitution. J. Mol. Evol. 36: 96–99.
———. 1997. Molecular evolution. Sinauer Associates, Sunderland, Massachusetts.
Li W.-H. and Graur D. 1991. Fundamentals of molecular evolution, pp. 106–111. Sinauer
Associates, Sunderland, Massachusetts.
Li W.-H. and Gu X. 1996. Estimating evolutionary distances between DNA sequences.
Methods Enzymol. 266: 449–459.
Li W.-H., Wu C.I., and Luo C.C. 1985. A new method for estimating synonymous and
nonsynonymous rates of nucleotide substitution considering the relative likelihood of
nucleotide and codon changes. Mol. Biol. Evol. 2: 150–174.
Lockhart, P.J. et al. (1994) Mol. Biol. Evol. 11,605–612
Maddison W.P. and Maddison D.R. 1992. MacClade: Analysis of phylogeny and
character evolution (version 3). Sinauer Associates, Sunderland, Massachusetts
Martin W. 1999. Mosaic bacterial chromosomes: A challenge en route to a tree of
genomes. Bioessays 21: 99–104.
Mayr E. 1998. Two empires or three? Proc. Natl. Acad. Sci. 95: 9720–9723
McDonald J.H. and Kreitman M. 1991. Adaptive protein evolution at the Adh locus in
Drosophila. Nature 351: 652–654.
Miyamoto M.M. and Cracraft J. 1991. Phylogenetic analysis of DNA sequences. Oxford
University Press, New York.
Nielsen R. and Yang Z. 1998. Likelihood models for detecting positively selected amino
acid sites and applications to the HIV-1 envelope gene. Genetics 148: 929–936.
Nuttal GHF (1904) Blood Immunity and Blood Relationship Cambridge Unerversity
Press, Cambridge.
Rzhetsky, A., and Nei, M. (1992). A simple method for estimating and testing minimum
Rzhetsky, A., and Nei, M. (1994). METREE: A program package for inferring and testing
minimum-evolution trees. Comput. Appl. Biosci. 10, 409–412.
Saitou N. 1996. Reconstruction of gene trees from sequence data. Methods Enzymol.
266: 427–449
Saitou N. and Nei M. 1987. The neighbor-joining method: A new method for
reconstructing phylogenetic trees. Mol. Biol. Evol. 4: 406–425.
Schadt E.E., Sinsheimer J.S., and Lange K. 1998. Computational advances in maximum
likelihood methods for molecular phylogeny. Genome Res. 8: 222–233.
Snel B., Bork P., and Huynen M.A. 1999. Genome phylogeny based on gene content.
Nat. Genet. 21: 108–110.
28
Phylogenetic analysis- a bioinformatics tool
29