You are on page 1of 32

Review work on

Phylogenetic Analysis -a bioinformatics


tool

M.Sc. 2nd SEMESTER


ROLL: VU/PG/MIC II-S, NO: 03
REG.NO: 161515 YEAR: 2003-04
Department of Microbiology.
Vidyasagar University.
Midnapore.
Acknowledgement

This review paper would not have been possible without the help & encouragement
of many individuals. It gives me tremendous pleasure in acknowledging the valuable
assistance extends to me by various people in the successful completion of this program.
First & foremost I wish to acknowledge my profound thanks to our respected teacher Prof.
Bikash Ranjan Pati; Head of Microbiology Dept.,Vidyasagar Universiy; Dr Keshab
Ch.Mondal & Dr. Debdulal Banerjee; lecturer Dept.of Microbiology, Vidyasagar University
for their constant encouragement and valuable guidance & suggestions.
I also express my thanks to Mr. Pradeep DasMahapatra, & Mr.Jhereswer
Chalak, Laboratory Assistant of The Dept. of Microbiology, Vidyasagar University.
I like to express my special thanks to Mr. Paltu Dhal, Scholar, IIT; Kharagpur
for his valuable advice and support.
I must express my deepest regard to my parents for their continuous blessing and
support.
Last but not the least, I would like to thanks all of my friends and my department who
gave me timely support due to which the seminar report has been completed successfully.

..............................................................

Uttam Kumar Patra


M.Sc. 2nd Sem.
Dept. of Microbiology.
Vidyasagar University.
CONTENT

Subject Page No

Abstract 1
Introduction 1
Basic Terminology in Phylogenetic Tree 4
The Concept of Evolutionary Trees 4
Fundamental Elements of Phylogenetic Models 7
Tree Interpretation—the Importance of Identifying 8
Paralogs and Orthologs
Genome Complexity and Phylogenetic Analysis 9
Phylogenetic Data Analysis 13
Relationship of Phylogenetic Analysis to Sequence Alignment 13
Tree-Building Methods 15
Outline of Tree-Building Method 15
Character-Based Methods 16
™ Maximum Parsimony Method 17
™ Maximum Likelihood Method 20
Distance-Based Methods 21
™ Unweighted Pair Group Method With Arithmetic Mean (UPGMA) 22
™ Neighbor Joining (NJ) 22
™ Fitch-Margoliash (FM) 23
™ Minimum Evolution (ME) 23
™ Which Distance-Based Tree-Building Procedure Is Best? 23
The Difference between Distance, Parsimony, and
Maximum Likelihood Methods 24
Pitfalls of Phylogenetic Analysis 24
Tree Evaluation 25
Reliability of Phylogenetic Predictions 25
Complications from Phylogenetic Analysis 26
Reference 27
Phylogenetic analysis- a bioinformatics tool

Abstract:
Phylogenetics is the study of evolutionary relationships. Phylogenetic analysis is
the means of inferring or estimating these relationships. The evolutionary history inferred
from phylogenetic analysis is usually depicted as branching, treelike diagrams that
represent an estimated pedigree of the inherited relationships among molecules,
organisms, or both. The phylogenetic relationship between two or more sets of
sequences is often extremely important information for bioinformatics analyses such as
the construction of sequence alignment. It is not unreasonable to think of the kind of
molecular data being something of a historical document that contains within it evidence
of the important steps in the evolution of a gene. The very same evolutionary events
(substitutions, insertions, deletions, and rearrangements) that are important to the
history of a gene can also be used to resolve questions about the evolutionary history
and relationship between entire species. In fact, the phylogenetic relationships among
many kinds of organisms are difficult to determine in any another way. The true
relationship between homologous sequences is hardly ever known aside from
computer simulation experiments. A variety of approaches are available for inferring the
most likely phylogenetic relationship between genes and species using nucleotide and
protein sequence information. Thus bioinformatics tend to determine statistically based,
commonly used inferring evolutionary relationship.

Introduction:
Phylogenetic analysis of nucleic acid and protein sequences is presently and will
continue to be an important area of sequence analysis. In addition to analyzing changes
that have occurred in the evolution of different organisms, the evolution of a family of
sequences may be studied. On the basis of the analysis, sequences that are the most
closely related can be identified by their occupying neighboring branches on a tree.
When a gene family is found in an organism or group of organisms, phylogenetic
relationships among the genes can help to predict which ones might have an equivalent
function. These functional predictions can then be tested by genetic experiments.
Phylogenetic analysis may also be used to follow the changes occurring in a rapidly
changing species, such as a virus. Analysis of the types of changes within a population
can reveal, for example, whether or not a particular gene is under selection (McDonald
and Kreitman 1991; Comeron and Kreitman 1998; Nielsen and Yang 1998), an important
source of information in applications like epidemiology.
Molecular phylogenetics predates DNA sequencing by several decades. It is
derived from the traditional method for classifying organisms according to their
similarities and difference, as first practiced in a comprehensive fashion by Linnaeus in
the 18th century. Linnaeus was a systematicist not an evolutionist, his objective being to
place all known organisms into a logical classification which he believed would reveal
the great plan used by the Creator –the Systema Naturae in 1753. His framework
schemes dividing organism into a series of taxonomic categories, starting with kingdom
and progressing down though phylum, class, order, family and genus to species. The
naturalists of the 18th and early 19th centuries linked this hierarchy to a ‘tree of life’, an
analogy that was adopted by Darwin (1859) in The Origin of Species as a means of
describing the interconnected evolutionary histories of living organisms. The
classificatory scheme devised by Linnaeus therefore became reinterpreted as a
phylogeny indicating not just the similarity between species but also their evolutionary
relationship.

1
Phylogenetic analysis- a bioinformatics tool

Whether the objective is to construct a classification or to infer a phylogeny, the


relevant data are obtained by examining variable characters in the organisms being
compared. Originally, these characters were morphological features, but molecular data
were introduced at a surprisingly early stage. In 1904 Nuttall used immunological tests to
deduce relationships between varieties of animals, one of his objectives being to place
humans in their correct evolutionary position relative to other primates. Nuttall’s work
showed that molecular data can be used In phylogenetics, but the approach was not
widely adopted until the late 1950s, the delay being due largely to technical limitations,
but also partly because classification and phylogenetics had to undergo their own
evolutionary changes before the value of molecular data could be fully appreciated.
These changes came about with the introduction of phenetics (Michener and Sokal,
1957) and cladistics (Hennig, 1966), two novel phylogenetic methods which, although
quite different in their approach, both place emphasis on large datasets that can be
analyzed by rigorous mathematical procedures. The difficulty in obtaining large
mathematical datasets when morphological characters are used was one of the main
driving forces behind the gradual shift towards molecular data.
The sequences of protein and DNA molecules provide the most detailed and
unambiguous data from molecular phylogenetics, but techniques for proteins sequencing
did not become routine until the late 1960s, and rapid DNA sequencing was not
developed until 10 years after that.
By the and of the 1960s these indirect methods had been supplemented with an
increasing number of protein sequence studies ( Fitch and margoliash,1967 ) and during
the 1980s DNA- based phylogenetics began to be carried out on a large scale. Protein
sequences are still used today in some contexts, but DNA has now become far the
predominant molecule. This is mainly because DNA yields more phylogenetic
information than protein, the nucleotide sequence of a pair of homologous genes having
higher information content than the amino acid sequences of the corresponding proteins,
because mutations that results in non-synonymous change alter the DNA sequence but
did not affect the amino acid sequence. Entirely novel information can also be obtained
by DNA sequence analysis because variability in both in coding and non coding regions
of the genome can be examined. The case with which DNA samples for sequence can
be prepared by PCR in another key reason behind the predominance of DNA in modern
molecular phylogenetics.
Procedures for phylogenetic analysis are strongly linked to those for sequence
alignment and similar difficulties are encountered. Just as two very similar sequences
can be easily aligned even by eye, a group of sequences that are very similar but with a
small level of variation throughout can easily be organized into a tree. Conversely, as
sequences become more and more different through evolutionary change, they can be
much more difficult to align. A phylogenetic analysis of very different sequences is also
difficult to do because there are so many possible evolutionary paths that could have
been followed to produce the observed sequence variation. Because of the complexity of
this problem, considerable expertise is required for difficult situations.
Macromolecules, especially sequences, have surpassed morphological and other
organismal characters as the most popular form of data for phylogenetic or cladistic
analysis. It is unrealistic to believe that an all-purpose phylogenetic analysis recipe can
be delineated (Hillis et al., 1993). Although numerous phylogenetic algorithms,
procedures, and computer programs have been devised, their reliability and practicality
are, in all cases, dependent on the structure and size of the data. The merits and pitfalls
of various methods are the subject of often acrimonious debates in taxonomic and
phylogenetic journals. Some of these debates are summarized in a series of useful

2
Phylogenetic analysis- a bioinformatics tool

reviews of phylogenetics (Saitou, 1996; Li, 1997; Swofford et al., 1996). An especially
concise introduction to molecular phylogenetics is provided by Hillis et al. (1993).
The danger of generating incorrect results is inherently greater in computational
phylogenetics than in many other fields of science. The events yielding a phylogeny
happened in the past and can only be inferred or estimated (with a few exceptions, Hillis
et al., 1994). Despite the well-documented limitations of available phylogenetic
procedures, current biological literature is replete with examples of conclusions derived
from the results of analyses in which data had been simply run through one or another
phylogeny program. Occasionally, the limiting factor in phylogenetic analysis is not so
much the computational method used; more often than not, the limiting factor is the
users’ understanding of what the method is actually doing with the data.
This brief guide to phylogenetic analysis has several objectives. First, a
conceptual approach that describes some of the most important principles underlying the
most widely and easily applied methods of phylogenetic analyses of biological
sequences and their interpretation will be introduced. The aim is to show that practical
phylogenetic analysis should be conceived as a search for a correct model, as much as
a search for the correct tree. In this context, some of the particular models assumed by
various popular methods and how these models might affect analysis of particular data
sets will be discussed. Finally, some examples of the application of particular methods to
the inferences of evolutionary history are provided.
Phylogenetic analysis programs are widely available at little or no cost. A
comprehensive list will not be given here since one has been published previously
(Swofford et al. 1996). The main ones in use are PHYLIP (phylogenetic inference
package) (Felsenstein 1989 1996) available from Dr. J. Felsenstein at
www.evolution.genetics.washington.edu/ phylip and PAUP (phylogenetic analysis using
parsimony) available from Sinauer Associates, Sunderland, Massachusetts,
www.lms.si.edu/PAUP/. Current versions of these programs provide the three main
methods for phylogenetic analysis—parsimony, distance, and maximum likelihood
methods —and also include many types of evolutionary models for sequence variation.
Each program requires a particular type of input sequence format. Another program,
MacClade, is useful for detailed analysis of the predictions made by PHYLIP, PAUP, and
other phylogenetic programs and is also available from Sinauer
( www.phylogeny.arizona.edu/macclade/ macclade). MacClade, as the name suggests,
runs on a Macintosh computer. PHYLIP and PAUP run on practically any machine, but
the user interface for PAUP has been most developed for use on the Macintosh
computer. There are also several Web sites that provide information on phylogenetic
relationships among organisms. There are several excellent descriptions of phylogenetic
analysis in which the methods are covered in considerable depth (Li and Graur 1991;
Miyamoto and Cracraft 1991; Felsenstein 1996; Li and Gu 1996; Saitou 1996; Swofford
et al. 1996; Li 1997).

3
Phylogenetic analysis- a bioinformatics tool

Basic terminology in phylogenetic tree:

Root: the common ancestor of all taxa.


Branch: reflects the relationship between taxa according to descent and ancestry.
Branch length: indicates the number of changes that have occurred in the branch.
Node: is a taxonomic unit identifying either an existing or an extinct species. It is a
bifurcating branch point.
Clade: a subgroup of two or more taxa or DNA/Protein sequences that includes both
their common ancestor and all of their descendents. Clades are groups of organisms
or genes that include the most recent common ancestor of all of its members and all
of the descendants of that most recent common ancestor. Clade is derived from the
Greek word ‘klados’, meaning branch or twig.
Distance scale: scale that represents the number of differences between organisms
or sequences.
Topology: defines the branching patterns of the tree.
Taxon: is any named group of organisms but not necessarily a clade.

The Concept of Evolutionary Trees:


An evolutionary tree is a two-dimensional graph showing evolutionary
relationships among organisms, or in the case of sequences, in certain genes from
separate organisms. The separate sequences are referred to as taxa (singular taxon),
defined as phylogenetically distinct units on the tree. These are the concepts of rooted
and unrooted trees. A rooted tree corresponds, more or less, to everyone’s idea of a
tree. Typically, the ancestral state of the organisms, or genes, being studied are shown
at the bottom of the tree, and the tree branches, or bifurcates, until it reaches the
terminal branches (or tips, or leaves) at the top of the tree. There is nothing sacred about
this convention, however, and trees can also be drawn with the tips at the bottom, at the
left, at the right or even at a 458 angle. An unrooted tree is a less-intuitive, more-abstract
concept. Unrooted trees represent the branching order, but do not indicate the root, or
location, of the last common ancestor. Ideally, rooted trees are preferable, but, in

4
Phylogenetic analysis- a bioinformatics tool

practice, virtually every phylogenetic reconstruction algorithm provides an unrooted


tree; thus, one needs to become familiar with them.

Figure:1; Structure of evolutionary trees.

The tree is composed of outer branches (or leaves) representing the taxa and
nodes and branches representing relationships among the taxa, illustrated as sequences
A–D in Figure:1. Thus, sequences A and B are derived from a common ancestor
sequence represented by the node below them, and C and D are similarly related. The
A/B and C/D common ancestors also share a common ancestor represented by a node
at the lowest level of the tree. It is important to recognize that each node in the tree
represents a splitting of the evolutionary path of the gene into two different species that
are isolated reproductively. Beyond that point, any further evolutionary changes in each
new branch are independent of those in the other new branch. The length of each
branch to the next node represents the number of sequence changes that occurred prior
to the next level of separation. Note that, in this example, the branch length between the
A/B node and A is approximately equal to that between the A/B node and B, indicating
the species are evolving at the same rate.
The amount of evolutionary time that has transpired since the separation of A
and B is usually not known. What is estimated by phylogenetic analysis is the amount of
sequence change between the A/B node and A, and also between the A/B node and B.
Hence, judging by the branch lengths from this node to A and B, the same number of
sequence changes has occurred. However, it is also likely that for some biological or
environmental reason unique to each species, one taxon may have undergone more
mutations since diverging from the ancestor than the other. In this case, different branch
lengths would be shown on the tree. Some types of phylogenetic analyses assume that
the rates of evolution in the tree branches are the same, whereas others assume that
they vary, as discussed below. The assumption of a uniform rate of mutation in the tree
branches is known as the molecular clock hypothesis and is usually most suitable for
closely related species (Li and Graur 1991; Li 1997). Tests for this hypothesis have been

5
Phylogenetic analysis- a bioinformatics tool

devised as described below. Even if there is a common rate of evolutionary change,


statistical variations from one branch to another can influence the analysis. The number
of substitutions in each branch is generally assumed to vary according to the Poisson
distribution, and the rate of change is assumed to be equal across all sequence
positions (Swofford et al. 1996).
An alternative representation of the relationships among sequences A–D in
Figure A is shown in Figure B. The difference between the tree in A and that in B is that
the tree in B is unrooted. The unrooted tree also shows the evolutionary relationships
among sequences A–D, but it does not reveal the location of the oldest ancestry. B
could be converted into A by placing another node and adjoining root to the black line. A
root could also be placed anywhere else in the tree. Hence, there are a great many more
possibilities for rooted than for unrooted trees for a given number of taxa or sequences.
The tree shown is only one of many, each predicting a different evolutionary
relationship among the sequences or taxa. The number of possible rooted trees
increases very rapidly with the number of sequences or taxa. A root has been placed at
this position indicating that in this evolutionary model of the sequences this basal node is
the common ancestor of all of the other sequences. A unique path leads from the root
node to any other node, and the direction of the path indicates the passage of
evolutionary time. The root is defined by including a taxon that we are reasonably sure
branched off earlier than the other taxa under study but should be related to the
remaining taxa. It is also possible to predict a root, assuming that the molecular clock
hypothesis holds.
The sum of all the branch lengths in a tree is referred to as the tree length. The
tree is also a bifurcating or binary tree, in that only two branches emanate from each
node. This situation is what one would expect during evolution—only one splitting away
of a new species at a time. Trees can have more than one branch emanating from a
node if the events separating taxa are so close that they cannot be resolved, or to
simplify the tree.

6
Phylogenetic analysis- a bioinformatics tool

Figure: 2 possible kind of phylogenetic tree.

Fundamental Elements of Phylogenetic Models:


Phylogenetic tree-building methods presume particular evolutionary models. For
a given data set, these models can be violated because of occurrences such as the
transfer of genetic material between organisms. Thus, when interpreting a given
analysis, one should always consider the model used and its assumptions and entertain
other possible explanations for the observed results
Models inherent in phylogenetics methods make additional ‘‘default’’
assumptions:
1. The sequence is correct and originates from the specified source.
2. The sequences are homologous (i.e., are all descended in some way from a
shared ancestral sequence).
3. Each position in a sequence alignment is homologous with every other in
that alignment.
4. Each of the multiple sequences included in a common analysis has a common
phylogenetic history with the others (e.g., there are no mixtures of nuclear and organellar
sequences).
. 5. The sampling of taxa is adequate to resolve the problem of interest.
6. Sequence variation among the samples is representative of the broader group
of interest.
7. The sequence variability in the sample contains phylogenetic signal adequate
to resolve the problem of interest.

7
Phylogenetic analysis- a bioinformatics tool

There are additional assumptions that are defaults in some methods but can be
at least partially corrected for in others:
1. The sequences in the sample evolved according to a single stochastic
process.
2. All positions in the sequence evolved according to the same stochastic
process.
3. Each position in the sequence evolved independently.
Errors in published phylogenetic analyses can often be attributed to violations
of one or more of the foregoing assumptions. Every sequence data set must be
evaluated against these assumptions, with other possible explanations for the
observed results considered.

Tree Interpretation—The Importance Of Identifying


Paralogs And Orthologs:
As more genomes are sequenced, we are becoming more interested in learning
about protein or gene evolution (i.e., investigating gene phylogeny, rather than
organismal phylogeny). This can aid our understanding of the function of proteins and
genes.
Studies of protein and gene evolution involve the comparison of homologs—
sequences that have common origins but may or may not have common activity.
Sequences that share an arbitrary, threshold level of similarity determined by alignment
of matching bases are termed homologous. They are inherited from a common ancestor
that possessed similar structure, although the structure of the ancestor may be difficult to
determine because it has been modified through descent.
Homologs are most commonly either orthologs, paralogs, or xenologs.

Figure: 3; the evolution of ortholog and paralog.

• Orthologs are homologs produced by speciation. They represent genes derived


from a common ancestor that diverged due to divergence of the organisms
they are associated with. They tend to have similar function.

8
Phylogenetic analysis- a bioinformatics tool

• Paralogs are homologs produced by gene duplication. They represent genes


derived from a common ancestral gene that duplicated within an organism and then
subsequently diverged. They tend to have different functions.
• Xenologs are homologs resulting from horizontal gene transfer between two
organisms. The determination of whether a gene of interest was recently transferred into
the current host by horizontal gene transfer is often difficult. Occasionally, the %(G + C)
content may be so vastly different from the average gene in the current host that a
conclusion of external origin is nearly inescapable, however often it is unclear whether a
gene has horizontal origins. Function of xenologs can be variable depending on how
significant the change in context was for the horizontally moving gene; however, in
general, the function tends to be similar.

Genome Complexity and Phylogenetic Analysis:


When performing a phylogenetic analysis, it is important to keep in mind that the
genomes of most organisms have a complex origin. Some parts of the genome are
passed on by vertical descent through the normal reproductive cycle. Other parts may
have arisen by horizontal transfer of genetic material between species through a virus,
DNA transformation, symbiosis, or some other horizontal transfer mechanism.
Accordingly, when a particular gene is being subjected to phylogenetic analysis, the
evolutionary history of that gene may not coincide with the evolutionary history of
another.
One of the most significant uses of phylogenetic analysis of sequences is to make
predictions concerning the tree of life. For this purpose, a gene should be selected that
is universally present in all organisms and easily recognizable by the conservation of
sequence in many species. At the same time, there should be enough sequence
variation to determine which groups of organisms share the same phylogenetic origin.
Ideally, the gene should also not be under selection, meaning that as variation occurs in
populations of organisms, certain sequences are not favored with a loss of the more
primitive variation.
Two molecules of this type that carry a great deal of evolutionary history in inter-species
sequence variations are the small rRNA subunit and mitochondrial sequences. A large
number of rRNA sequences from a variety of organisms were aligned. Phylogenetic
predictions were then made using the distance method described below (Woese,1987).
On the basis of rRNA sequence signatures, or regions within the molecule that are
conserved in one group of organisms but different in another (Woese,1987) predicted
that early life diverged into three main kingdoms—Archaea, Bacteria, and Eukarya—a
view that has been challenged (Mayr 1998). Evidence for the presence of additional
organisms in these groups has since been found by PCR amplification of environmental
samples of RNA (Barns et al. 1996). A more detailed analysis was used to find
relationships among individual species within each group. The types of relationships
found among the prokaryotic organisms are illustrated in Figure 4. The use of
mitochondrial sequences for analysis of primate evolution is given below in the
description of the parsimony method of phylogenetic analysis.

9
Phylogenetic analysis- a bioinformatics tool

Figure 4. Rooted tree of life showing principal relationships among prokaryotic domains
Bacteria and Archaea (Woese 1987; Barns et al. 1996; Brown and Doolittle 1997).
Branch lengths are approximate only. Species that have been sequenced or are being
sequenced are shown.

10
Phylogenetic analysis- a bioinformatics tool

Although these studies of rRNA sequences suggest a quite clear-cut model for
the evolution of life, phylogenetic analysis of other genes and gene families has revealed
that the situation is probably more complex and that a more appropriate model might be
the one shown in Figure 5. There are now many examples of horizontal or lateral
transfer of genes between species that introduce new genes and sequences into an
organism (Brown and Doolittle 1997; Doolittle 1999). These types of transfers are
inferred from the finding that the phylogenetic histories of different genes in an organism,
such as genes for metabolic functions, are not the same or that codon use in different
genes varies. Another type of phylogenetic analysis is based on the number of genes
shared between genomes and produces a tree that is similar to the rRNA tree (Snel et
al. 1999).

Figure 5: The reticulated or net-like form of the tree of life. Analysis of rRNA sequences
originally suggested three main branches in the tree of life, Archaea, Bacteria, and
Eukarya. Subsequent phylogenetic analysis of genes for some metabolic enzymes is not
congruent with the rRNA tree. Hence, for these metabolic genes, the tree has a
reticulated form due to horizontal transfer of these genes between species. (Martin
1999)

11
Phylogenetic analysis- a bioinformatics tool

To track the evolutionary history of genes, more attention has also been paid to
the methodology of phylogenetic analysis and to the inherent errors in many of the
assumptions (Doolittle 1999). Problems associated with variations between rates of
change in different sites and of analyzing more distantly related sequences are
discussed below. Moreover, there is evidence that genomes undergo extensive
rearrangements, placing sequences of different evolutionary origin next to each other
and even causing rearrangements within protein-encoding genes (Henikoff et al. 1997).
The different regions of independent evolutionary origin in a sequence therefore
need to be identified. Proteins are modular with functional domains, sometimes repeated
within a protein and sometimes shared within a protein family. These regions are
identified by their sharing of significant sequence similarity. The remainder of the aligned
regions in the group may have variable levels of similarity. In nucleic acid sequences, a
given sequence pattern may provide a binding site for a regulatory molecule, leading to
promoter function, RNA splicing, or some other function. It may be difficult to decide the
extent of these patterns for phylogenetic analysis.

Figure 6: DNA Sequence Evolution

Another feature of genome evolution that should be considered in phylogenetic


analysis is the occurrence of gene duplication events that create tandem copies of a
gene. These two copies may then evolve along separate pathways leading to different
functions. However, these copies maintain a certain level of similarity and undergo
concerted evolution, a process of acquiring mutations in a coordinated way, probably
through gene conversion or recombination events. Speciation events following gene
duplications will give rise to two independent sets of genes and sequences, one set for
each gene copy. Two genes in the same lineage can have different relationships.

12
Phylogenetic analysis- a bioinformatics tool

Phylogenetic Data Analysis:


A straightforward phylogenetic analysis consists of three major steps:
1. Alignment (both building the data model and extracting a phylogenetic
dataset)
2. Tree construction
3. Tree evaluation
Each step is critical for the analysis and should be handled accordingly. For example,
trees are only as good as the alignment they are based on. When performing a
phylogenetic analysis, it often insightful to build trees based on different modifications of
the alignment to see how the alignment proposed influences the resulting tree.

Relationship of Phylogenetic Analysis to Sequence


Alignment:
When the sequences of two nucleic acid or protein molecules found in two
different organisms are similar, they are likely to have been derived from a common
ancestor sequence. Sequence alignment methods used to determine sequence
similarity. Multiple sequence alignment methods that need to be applied to a set of
related sequences before a phylogenetic analysis can be performed. The methods for
searching through a database of sequences to locate sequences that are similar to a
query sequence. A sequence alignment reveals which positions in the sequences were
conserved and which diverged from a common ancestor sequence. When one is quite
certain that two sequences share an evolutionary relationship, the sequences are
referred to as being homologous.
The commonest method of multiple sequence alignment, first aligns the most
closely related pair of sequences and then sequentially adds more distantly related
sequences or sets of sequences to this initial alignment. The alignment so obtained is
influenced by the most alike sequences in the group and thus may not represent a
reliable history of the evolutionary changes that have occurred. Other methods of
multiple sequence alignment attempt to circumvent the influence of alike sequences.
Once a multiple sequence alignment has been obtained, each column is assumed to
correspond to an individual site that has been evolving according to the observed
sequence variation in the column. Most methods of phylogenetic analysis assume that
each position in the protein or nucleic acid sequence changes independently of the
others (analysis of RNA sequence evolution is an exception).
As indicated above, the analysis of sequences that are strongly similar along
their entire lengths is quite straightforward. However, to align most sequences requires
the positioning of gaps in the alignment. Gaps represent an insertion or deletion of one
or more sequence characters during evolution. Proteins that align well are likely to have
the same three-dimensional structure. In general, sequences that lie in the core structure
of such proteins are not subject to insertions or deletions because any amino acid
substitutions must fit into the packed hydrophobic environment of the core. Gaps should
therefore be rare in regions of multiple sequence alignments that represent these core
sequences. In contrast, more variation, including insertions and deletions, may be found
in the loop regions on the outside of the three-dimensional structure because these
regions do not influence the core structure as much. Loop regions interact with the
environment of small molecules, membranes, and other proteins.

13
Phylogenetic analysis- a bioinformatics tool

Figure7: Outline of multiple sequence alignment (MSA) [Mount]


.
Gaps in alignments can be thought of as representing mutational changes in
sequences, including insertions, deletions, or rearrangements of genetic material. The
expectation that a gap of virtually any length can occur as a single event introduces the
problem of judging how many individual changes have occurred and in what order. Gaps
are treated in various ways by phylogenetic programs, but no clear-cut model as to how
they should be treated has been devised. Many methods ignore gaps or focus on
regions in an alignment that do not have any gaps. Nevertheless, gaps can be useful as
phylogenetic markers in some situations.
Another approach for handling gaps is to avoid analysis of individual sites in the
sequence alignment and instead to use sequence similarity scores as a basis for
phylogenetic analysis. Rather than trying to decide what has happened at each
sequence position in an alignment, a similarity score based on a scoring matrix with
penalties for gaps is often used. These scores may be converted to distance scores that
are suitable for phylogenetic analysis (Feng and Doolittle 1996) by distance methods.

14
Phylogenetic analysis- a bioinformatics tool

Tree-Building Methods:
Tree-building methods implemented in available software are discussed in detail
in the literature (Saitou, 1996; Swofford et al., 1996; Li, 1997) and described on the
Internet. This section briefly describes some of the most popular methods. Tree building
methods can be sorted into distance-based vs. character-based methods. Much of the
discussion in molecular phylogenetics dwells on the utility of distance and character-
based methods (e.g., Saitou, 1996; Li, 1997). Distance methods compute pairwise
distances according to some measure and then discard the actual data, using only the
fixed distances to derive trees. Character-based methods derive trees that optimize the
distribution of the actual data patterns for each character. Pairwise distances are,
therefore, not fixed, as they are determined by the tree topology. The most commonly
applied distance-based methods include neighbor-joining and the Fitch-Margoliash
method, and the most common character-based methods include maximum parsimony
and maximum likelihood.

Outline of Tree-Building Method:

Figure 8: basic scheme of phylogenetic analysis methods [Mount].

1. The sequences chosen can be either DNA or protein sequence: Different programs and
program options are used for each type. RNA sequences are analyzed by covariation methods
and by analyzing changes in secondary structure. The selected sequences should align with each
other along their entire lengths, or else each should have a common set of patterns or domains
that provides a strong indication of evolutionary relatedness.
2. The alignment of the sequence pairs should not have a large number of gaps that are
obviously necessary to align identical or related characters. A phylogenetic analysis should only
be performed on parts of sequences that can be reasonably aligned. In general, phylogenetic
methods analyze conserved regions that are represented in all the sequences. The more similar

15
Phylogenetic analysis- a bioinformatics tool

the sequences are to each other, the better. The simplest evolutionary models assume that the
variation in each column of the multiple sequence alignment represents single-step changes and
that no reversals (A → T → A) have occurred. As the observed variation increases, more
multiple-step changes (A → T → G) and reversions are likely to be present. Corrections may be
applied for such variation, thereby increasing the observed amount of change to a more
reasonable value. These corrections assume a uniform rate of change at all sequence positions
over time. Gaps in the multiple sequence alignment are usually not scored because there is no
suitable model for the evolutionary mechanisms that produce them.
3. This question is designed to select sequences suitable for maximum parsimony analysis. Other
methods may also be used with these same sequences. For parsimony analysis, the best results
are obtained when the amount of variation among all pairs of sequences is similar (no very
different sequences are present) and when the amount of variation is small. Some columns in the
multiple sequence alignment will have the same residue in all sequences; other columns will
include both conserved and nonconserved residues. There should be a clear-cut majority of
certain residues in some columns of the alignment but also some variation. These more common
residues are taken to represent an earlier group of sequences from which others were derived. If
there is too much variation, there will be too many possible ancestral relationships. Because the
maximum parsimony method has to attempt to fit all possible trees to the data, the method is not
suitable for more than 11 or 12 sequences because there are too many trees to test. More than
one tree may be found to be equally parsimonious. A consensus tree representing the conserved
features of the different trees may then be produced.
4. The purpose of this question is to select sequences for phylogenetic analysis by distance
methods. Distance methods are able to predict an evolutionary tree when variation among the
sequences is present (some sequences are more alike than others) and when the amount of
variation is intermediate. The number of changed positions in an alignment between two
sequences divided by the total number of matched positions is the distance between the
sequences. As distances increase, corrections are necessary for deviations from single-step
changes between sequences. Of course, as distances increase, the uncertainty of alignments
also increases, and a reassessment of the suitability of the multiple sequence alignment method
may be necessary. Sequences with this type of variation may also be suitable for phylogenetic
analysis by maximum likelihood methods. Distance methods may be used with a large number of
sequences. The program CLUSTALW produces a distance-based tree at the same time as a
multiple sequence alignment (Higgins et al. 1996).
5. Maximum likelihood methods may be used for any set of related sequences, but they are
particularly useful when the sequences are more variable. These methods are computationally
intense, and computational complexity increases with the number of sequences since the
probability of every possible tree must be calculated as described in the text. An advantage of
these methods is that they provide evolutionary models to account for the variation in the
sequences.
6. The data in the multiple sequence alignment columns is resampled to test how well the
branches on the evolutionary tree are supported (boot-strapping).

Character-Based Methods:
The character-based methods have little in common with each other, besides the
use of the character data at all steps in the analysis. This allows the assessment of the
reliability of each base position in an alignment on the basis of all other base positions.

16
Phylogenetic analysis- a bioinformatics tool

Maximum Parsimony Method (MP):


This method predicts the evolutionary tree (or a tree) that minimizes the number
of steps required to generate the observed variation in the sequences. For this reason,
the method is also sometimes referred to as the minimum evolution method. A multiple
sequence alignment is required to predict which sequence positions are likely to
correspond. These positions will appear in vertical columns in the multiple sequence
alignment. For each aligned position, phylogenetic trees that require the smallest
number of evolutionary changes to produce the observed sequence changes are
identified. This analysis is continued for every position in the sequence alignment.
Finally, those trees that produce the smallest number of changes overall for all sequence
positions are identified. This method is used for sequences that are quite similar and for
small numbers of sequences, for which it is best suited. The algorithm followed is not
particularly complicated, but it is guaranteed to find the best tree, because all possible
trees relating a group of sequences are examined. For this reason, the method is quite
time-consuming and is not useful for data that include a large number of sequences or
sequences with a large amount of variation. One or more unrooted trees are predicted
and other assumptions must be made to root the predicted tree.
Maximum parsimony is an optimization criterion that adheres to the principle that
the best explanation of the data is the simplest, which in turn is the one requiring the
fewest ad hoc assumptions. In practical terms, the MP tree is the shortest—the one with
the fewest changes—which, by definition, is also the one with the fewest parallel
changes. There are several variants of MP that differ with regard to the permitted
directionality of character state change (Swofford et al., 1996).
To accommodate substitution bias, MP is amenable to weighting; for example,
the transformation of a transversion can be weighted relative to a transition. The easiest
way to do this is to create a weighting step matrix in which the weights are the reciprocal
of the rates estimated using ML as described above. However, step-matrix weighting
can greatly slow MP computation.
The MP method performs poorly when there is substantial among-site rate heterogeneity
(Huelsenbeck, 1995). There are few good fixes for this problem. One approach is to
modify the data set to include only sites that exhibit little or no heterogeneity as
determined by likelihood estimation. Another approach is to recursively reweight
positions according to their propensity to change as observed in preliminary trees. This
‘‘successive approximations’’ approach is automatically facilitated in PAUP, but it is
prone to error to the degree that the preliminary trees are incorrect. PAUP offers a
number of options and parameter settings for a parsimony analysis in the Macintosh
environment. The main programs for maximum parsimony analysis in the PHYLIP
package (Felsenstein 1996)
MP analyses tend to yield numerous (and sometimes many thousands of) trees
that have the same score. Because each is held to be as optimal as any other, only
groupings present in the strict consensus of all trees are considered to be supported by
the data. The reason that distance and ML tree methods tend to arrive at a single best
tree is that their calculations involve division and decimals, whereas MP merely counts
discrete steps. For a given data set, a strict consensus of all ME or ML trees that are not
significantly worse than optimal probably would yield resolution more or less comparable
to the MP consensus. Unfortunately, whereas MP users conventionally present strict
consensus (and sometimes consensus of trees one or two steps worse), ME and ML
users typically do not.
Simulation studies have shown that MP performs no better than ME and worse

17
Phylogenetic analysis- a bioinformatics tool

than ML when the amount of sequence evolution since lineages diverged is much
greater than the amount of divergence that occurred between lineage splits (i.e., in a
tree with very long terminal branches and short internal internodes) (Huelsenbeck,
1995). This condition produces ‘‘long branch attraction’’—the long branches become
artificially connected because the number of nonhomologous similarities the sequences
have accumulated exceeds the number of homologous similarities they have retained
with their true closest relatives (Swofford et al., 1996). Character weighting improves the
performance of MP under these conditions (Huelsenbeck, 1995).

Example: Maximum Parsimony Analysis of Sequences:

Taxa Sequence position (sites) and character


1 2 3 4 5 6 7 8 9
1 A A G G G T G C A
2 A G C C G T G C G
3 A G A T A T C C A
4 A G A G A T C C G

Table 1: Example of phylogenetic analysis to find the correct unrooted tree from four
aligned sequences by the maximum parsimony method (Li and Graur 1991)

Rules for analysis by maximum parsimony in this example are:


1. In the analysis, all of the possible unrooted trees (three trees for four
sequences) are considered. The sequence variations at each site in the alignment are
placed at the tips of the trees, and the tree that requires the smallest number of changes
to produce this variation is determined. This analysis is repeated for each informative
site, and the tree (or trees) that supports the smallest number of changes overall is
found. The length of the tree, defined as the sum of the number of steps in each branch
of the tree, will be a minimum.
2. Some sites are informative, i.e., they favor one tree over another (site 5 is
informative but sites 1, 6, and 8 are not).
3. To be informative, a site must have the same sequence character in at least
two taxa (sites 1, 2, 3, 4, 6, and 8 are not informative; sites 5, 7, and 9 are
informative).
4. Only the informative sites need to be analyzed.
The three possible trees are shown in Figure 9. The optimal tree is obtained by
adding the number of changes at each informative site for each tree, and picking the tree
requiring the least number of changes. A scoring matrix may be used instead of scoring
a change as 1. Tree 1 is the correct one and the tree length will be 4 (one change at
each of positions 5 and 7 and two changes at position 9).

18
Phylogenetic analysis- a bioinformatics tool

Figure 9: Example of phylogenetic analysis using the maximum parsimony method. (Li
and Graur 1991). This figure shows an example of phylogenetic analysis by maximum
parsimony. This method finds the tree that changes any sequence into all of the others
by the least number of steps.

Features:
¾ Multiple sequence alignment needed
¾ Used for rather similar sequences to be analyzed in small numbers
¾ Time-consuming and computationally costly
¾ Widely used software in the field that implements Maximum parsimony and other
methods is PHYLIP
http://evolution.genetics.washington.edu/phylip.html

¾ PAUP* is also used for this purpose


http://paup.csit.fsu.edu/index.html

Optimality criterion: The ‘most-parsimonious’ tree is the one that requires the fewest
number of evolutionary events (e.g., nucleotide substitutions, amino acid replacements)
to explain the sequences.

Advantages:
¾ Are simple, intuitive, and logical (many possible by ‘pencil-and-paper’).
¾ Can be used on molecular and non-molecular (e.g., morphological) data.
¾ Can tease apart types of similarity (shared-derived, shared-ancestral,
homoplasy)
¾ Can be used for character (can infer the exact substitutions) and rate analysis.
¾ Can be used to infer the sequences of the extinct (hypothetical) ancestors.

Disadvantages:
¾ Can be fooled by high levels of homoplasy (‘same’ events).
¾ Can become positively misleading in the “Felsenstein Zone”:

19
Phylogenetic analysis- a bioinformatics tool

The Maximum Likelihood method (ml):


This method uses probability calculations to find a tree that best accounts for the
variation in a set of sequences. The method is similar to the maximum parsimony
method in that the analysis is performed on each column of a multiple sequence
alignment. All possible trees are considered. Hence, the method is only feasible for a
small number of sequences. For each tree, the number of sequence changes or
mutations that may have occurred to give the sequence variation is considered. Because
the rate of appearance of new mutations is very small, the more mutations needed to fit
a tree to the data, the less likely that tree (Felsenstein 1981). The maximum likelihood
method resembles the maximum parsimony method in that trees with the least number
of changes will be the most likely. However, the maximum likelihood method presents an
additional opportunity to evaluate trees with variations in mutation rates in different
lineages, and to use explicit evolutionary models such as the Jukes-Cantor and Kimura
models. Thus, the method can be used to explore relationships among more diverse
sequences, conditions that are not well handled by maximum parsimony methods.
However, with faster computers, the maximum likelihood method is seeing wider use
and is being used for more complex models of evolution (Schadt et al. 1998). Maximum
likelihood has also been used for an analysis of mutations in overlapping reading frames
in viruses (Hein and Støvlbæk 1996). PAUP version 4 can be used to perform a
maximum likelihood analysis on DNA sequences. The method has also been applied for
changes from one amino acid to another in protein sequences.
In practice, ML is derived for each base position in an alignment. The likelihood is
calculated in terms of the probability that the pattern of variation at a site would be
produced by a particular substitution process, given a particular tree and the overall
observed base frequencies. The likelihood becomes the sum of the probabilities of each
possible reconstruction of substitutions under a particular substitution process. The
likelihoods for all the sites are multiplied to give an overall ‘‘likelihood of the tree’’ (i.e.,
the probability of the data given the tree and the substitution process). As one can
imagine, for one particular tree, the likelihood of the data is low at some sites and high at
others. For a ‘‘good’’ tree, many sites will have higher likelihood, so the product of
likelihoods is high. For a ‘‘poor’’ tree, the reverse will be true.
The substitution model should be optimized to fit the observed data. For
example, if there is a transition bias, evident by an inordinate number of sites that
include only purines or pyrimidines, the likelihood of the data under a model that
assumes no bias will never be as good as one that does. Likewise, if a substantial
proportion of the sites are occupied by a single base and another substantial proportion
have an equal base frequencies, the likelihood of the data under a model that assumes
that all sites evolve equally will be less than that of a model that allows rate
heterogeneity. Modifying the substitution parameters, however, modifies the likelihood of
the data associated with particular trees. Thus, the tree yielding the highest likelihood
under one substitution model might yield much lower likelihood under another.
Because ML uses great amounts of computational time, it is usually impractical
to perform a complete search that simultaneously optimizes the substitution model and
the tree for a given data set. An economical, heuristic approach is recommended
(Adachi and Hasegawa, 1996; Swofford et al., 1996). Perhaps the best time saver in this
regard is preliminary ML estimation of the substitution model (as can be performed using
PAUP). This procedure can be applied iteratively, searching for better ML trees, then re
estimating the parameters, and then searching for better trees.
As algorithms, computers, and phylogenetic understanding have improved, the
ML criterion has become more popular for molecular phylogenetic analysis. In simulation

20
Phylogenetic analysis- a bioinformatics tool

studies, ML has consistently outperformed ME and MP when the data analysis proceeds
according to the same model that generates the data (Huelsenbeck, 1995). ML will
always be the most computationally intensive method of all, however, so there will
always be situations in which it is not practical.

Features:
¾ Uses probability calculations to find a tree that best accounts for the observed
sequence variations.
¾ All possible trees are considered (time-consuming)
¾ Few sequences can be analyzed
¾ It is possible to evaluate trees with mutations in different lineages
¾ Use evolutionary models that allow for variations in base composition (Jukes-
Cantor, Kimura)

Optimality criterion: ML methods evaluate phylogenetic hypotheses in terms of the


probability that a proposed model of the evolutionary process and the proposed
unrooted tree would give rise to the observed data. The tree found to have the highest
ML value is considered to be the preferred tree.

Advantages:
¾ Are inherently statistical and evolutionary model-based.
¾ Usually the most ‘consistent’ of the methods available.
¾ Can be used for character (can infer the exact substitutions) and rate analysis.
¾ Can be used to infer the sequences of the extinct (hypothetical) ancestors.
¾ Can help account for branch-length effects in unbalanced trees.
¾ Can be applied to nucleotide or amino acid sequences, and other types of data.

Disadvantages:
¾ Are not as simple and intuitive as many other methods.
¾ Are computationally very intense (Iimits number of taxa and length of sequence).
¾ Like parsimony, can be fooled by high levels of homoplasy.
¾ Violations of the assumed model can lead to incorrect trees.

Distance-Based Methods:
The distance method employs the number of changes between each pair in a
group of sequences to produce a phylogenetic tree of the group. The sequence pairs
that have the smallest number of sequence changes between them are termed
“neighbors.” On a tree, these sequences share a node or common ancestor position and
are each joined to that node by a branch. The goal of distance methods is to identify a
tree that positions the neighbors correctly and that also has branch lengths which
reproduce the original data as closely as possible. Finding the closest neighbors among
a group of sequences by the distance method is often the first step in producing a
multiple sequence alignment.
The distance method was pioneered by Feng and Doolittle, and a collection of
programs by these authors will produce both an alignment and tree of a set of protein
sequences (Feng and Doolittle 1996). The program CLUSTALW, uses the neighbor-
joining distance method as a guide to multiple sequence alignments. PAUP version 4
has options for performing a phylogenetic analysis by distance methods. Programs of

21
Phylogenetic analysis- a bioinformatics tool

the PHYLIP package that perform a distance analysis which automatically read in a
sequence in the PHYLIP infile format and automatically produce a file called outfile with
a distance table.
Distance-based methods use the amount of dissimilarity (the distance) between
two aligned sequences to derive trees. A distance method would reconstruct the true
tree if all genetic divergence events were accurately recorded in the sequence (Swofford
et al., 1996). However, divergence encounters an upper limit as sequences become
mutationally saturate. After one sequence of a diverging pair has mutated at a particular
site, subsequent mutations in either sequence cannot render the sites any more
‘‘different.’’ In fact, subsequent mutations can make them again equal (for example, if a
valine mutates to an isoleucine, which mutates back to a valine). Therefore, most
distance-based methods correct for such ‘‘unseen’’ substitutions. In practice, application
of the rate matrix effectively presumes that some proportion of observed pairwise base
identities actually represents multiple mutations and that this proportion increases with
increasing overall sequence divergence. Some programs implement, at least optionally,
calculation of uncorrected distances, whereas, for example, the MEGA program (Kumar
et al., 1994) implements only uncorrected distances for codon and amino acid data.
Unless overall divergences are very low, the latter approach is virtually guaranteed to
give inaccurate results.
Pairwise distance is calculated using maximum-likelihood estimators of
substitution rates. The most popular distance tree-building programs have a limited
number of substitution models, but PAUP 4.0 implements a number of models, including
the actual model estimated from the data using maximum likelihood, as well as the
logdet distance method.
Distance methods are much less computationally intensive than maximum
likelihood but can employ the same models of sequence evolution. This is their biggest
advantage. The disadvantage is that the actual character data are discarded. The most
commonly applied distance-based methods are the unweighted pair group method with
arithmetic mean (UPGMA), neighbor joining (NJ), and methods that optimize the
additivity of a distance tree, including the minimum evolution (ME) method. Several
methods are available in more than one phylogenetics software package but not all
implementations allow the same parameter specifications and/or tree optimization
features (e.g., branch swapping).

Unweighted Pair Group Method with Arithmetic Mean (UPGMA):


UPGMA is a clustering or phenetic algorithm—it joins tree branches based on the
criterion of greatest similarity among pairs and averages of joined pairs. It is not strictly
an evolutionary distance method (Li, 1997). UPGMA is expected to generate an
accurate topology with true branch lengths only when the divergence is according to a
molecular clock (ultrametric; Swofford et al., 1996) or approximately equal to raw
sequence dissimilarity. As mentioned earlier, these conditions are rarely met in practice.

Neighbor Joining (NJ):


The neighbor-joining algorithm is commonly applied with distance tree building,
regardless of the optimization criterion. The fully resolved tree is ‘‘decomposed’’ from a
fully unresolved ‘‘star’’ tree by successively inserting branches between a pair of closest
(actually, most isolated) neighbors and the remaining terminals in the tree (Fig. 10). The
closest neighbor pair is then consolidated, effectively reforming a star tree, and the
process is repeated. The method is comparatively rapid.

22
Phylogenetic analysis- a bioinformatics tool

Figure 10. Star decomposition. This is how tree-building algorithms such as neighbor
joining work. The most similar terminals are joined, and a branch is inserted between
them and the remainder of the star. Subsequently, the new branch is consolidated so
that its value is a mean of the two original values, yielding a star tree with n-1 terminals.
The process is repeated until only one terminal remains.

Fitch-Margoliash (FM):
The Fitch-Margoliash (FM) method seeks to maximize the fit of the observed
pairwise distances to a tree by minimizing the squared deviation of all possible observed
distances relative to all possible path lengths on the tree (Felsenstein, 1997). There are
several variations that differ in how the error is weighted. The variance estimates are not
completely independent because errors in all the internal tree branches are counted at
least twice (Rzhetsky and Nei, 1992).

Minimum Evolution (ME):


Minimum evolution seeks to find the shortest tree that is consistent with the path
lengths measured in a manner similar to FM; that is, ME works by minimizing the
squared deviation of observed to tree-based distances (Rzhetsky and Nei, 1992;
Swofford et al., 1996; Felsenstein, 1997). Unlike FM, ME does not use all possible
pairwise distances and all possible associated tree path lengths. Rather, it fixes the
location of internal tree nodes based on the distance to external nodes and then
optimizes the internal branch length according to the minimum measured error between
these ‘‘observed’’ points. It thus purports to eliminate the non independence of FM
measurements.

Which Distance-Based Tree-Building Procedure Is Best?


ME and FM appear to be the best procedures, and they perform nearly identically
in simulation studies (Huelsenbeck, 1995). ME is becoming more widely implemented in
computer programs, including METREE (Rzhetsky and Nei, 1994) and PAUP. For
protein data, the FM procedure in PHYLIP offers the greatest range of substitution
models but no correction for among-site rate heterogeneity. The MEGA (Kumar et al.,
1994) and METREE packages include a gamma correction for proteins, but only in
conjunction with a raw (‘‘p-distance’’) divergence model (no distance or bias correction),
which is unreliable except for small divergences (Rzhetsky and Nei, 1994). MEGA also
computes separate distances for synonymous and nonsynonymous sites, but this
method is valid only in the absence of substitution or base frequency bias and when
there is no correction for among-site rate heterogeneity. Thus, for most data sets, using

23
Phylogenetic analysis- a bioinformatics tool

the nucleotide data under a more realistic model might be preferable to MEGA’s
methods.
Simulation studies indicate that UPGMA performs poorly over a broad range of
tree shape space (Huelsenbeck, 1995). The use of this method is not recommended; it
is mentioned here only because its application seems to persist, as evidenced by
UPGMA gene trees appearing in publications (Huelsenbeck, 1995).
NJ is clearly the fastest procedure and generally yields a tree close to the ME
tree. (Rzhetsky and Nei, 1992; Li, 1997). However, it yields only one tree. Depending on
the structure of the data, numerous different trees might be as good or significantly
better than the NJ tree (Swofford et al., 1996).

THE DIFFERENCE between DISTANCE, PARSIMONY,


AND MAXIMUM LIKELIHOOD Methods:
Distance matrix methods simply count the number of differences between two
sequences. This number is referred to as the evolutionary distance, and its exact size
depends on the evolutionary model used. The actual tree is then computed from the
matrix of distance values by running a clustering algorithm that starts with the most
similar sequences (i.e., those that have the shortest distance between them) or by trying
to minimize the total branch length of the tree. The principle of maximum parsimony
searches for a tree that requires the smallest number of changes to explain the
differences observed among the taxa under study.
A maximum-likelihood approach to phylogenetic inference evaluates the
probability that the chosen evolutionary model has generated the observed data. The
evolutionary model could simply mean that one assumes that changes between all
nucleotides (or amino acids) are equally probable. The program will then assign all
possible nucleotides to the internal nodes of the tree in turn and calculate the probability
that each such sequence would have generated the data (if two sister taxa have the
nucleotide ‘‘A,’’ a reconstruction that assumes derivation from a ‘‘C’’ would be assigned
a low probability compared with a derivation that assumes there already was an ‘‘A’’).
The probabilities for all possible reconstructions (not just the more probable one) are
summed up to yield the likelihood for one particular site. The likelihood for the tree is the
product of the likelihoods for all alignment positions in the data set.

Pitfalls of phylogenetic analysis:


It is not generally appreciated that molecular sequence analysis is a field in its
infancy. It is an inexact science in which there are few analytical tools that are truly
based on general mathematical and statistical principles. Consequently, many, perhaps
most, phylogenetic trees reconstructed from molecular sequences are incorrect and
frequently conflict with common sense. This is mainly caused by one or more of the
three pitfalls of sequence analysis: (1) incorrect sequence alignments, caused by
inadequate mathematical models and often related specifically to biases created by
progressive alignment algorithms when they are used to align more than three taxa; (2)
the failure to account properly for site-to-site variation (all sites within sequences can
evolve at different rates); and (3) unequal rate effects (the inability of most tree-building
algorithms to produce good phylogenetic trees when genes from different taxa in the tree
evolve at different rates). All three pitfalls can produce the same artefact – long branch
attraction. In these artefactually produced trees, rapidly evolving sequences
(represented by long branches on phylogenetic trees) will be placed with other rapidly
evolving sequences, even if the sequences are only distantly related. In comparison with

24
Phylogenetic analysis- a bioinformatics tool

most problems in molecular biology, which can be solved by acquiring more data, long
branch attractions are much more complex. If longer sequences are used when long
branch attractions are present, the incorrect solution will be even more strongly
supported. Of the three pitfalls, alignment artefacts are potentially the most serious,
because even if the second and third problems are solved, the misalignments can still
produce incorrect trees. A new algorithm, paralinear (logdet) distances (Lockhart, P.J.
et al. 1994 and Lake, J.A.;1994) provides a simple, but rigorous, mathematical solution
for the third pitfall. This particular algorithm is now available in some of the phylogenetic
packages described below. (For a discussion of many other useful algorithms that are
available, including maximum parsimony, maximum likelihood and other distance
methods) (Stewart, C-B; 1993).

Tree Evaluation:
Several procedures are available that evaluate the phylogenetic signal in the
data and the robustness of trees (Swofford et al., 1996; Li, 1997). The most popular of
the former class are tests of data signal versus randomized data (skewness and
permutation tests). The latter class includes tests of tree support from resampling of
observed data (nonparametric bootstrap). The likelihood ratio test provides a means of
evaluating both the substitution model and the tree.

Reliability of Phylogenetic Predictions:


As discussed earlier, phylogenetic analysis of a set of sequences that aligns very
well is straightforward because the positions that correspond in the sequences can be
readily identified in a multiple sequence alignment of the sequences. The types of
changes in the aligned positions or the numbers of changes in the alignments between
pairs of sequences then provide a basis for a determination of phylogenetic relationships
among the sequences by the above methods of phylogenetic analysis. For sequences
that have diverged considerably, a phylogenetic analysis is more challenging. A
determination of the sequence changes that have occurred is more difficult because the
multiple sequence alignment may not be optimal and because multiple changes may
have occurred in the aligned sequence positions. The choice of a suitable multiple
sequence alignment method depends on the degree of variation among the sequences.
Once a suitable alignment has been found, one may also ask how well the predicted
phylogenetic relationships are supported by the data in the multiple sequence alignment.
In the bootstrap method, the data are resampled by randomly choosing vertical
columns from the aligned sequences to produce, in effect, a new sequence alignment of
the same length. Each column of data may be used more than once and some columns
may not be used at all in the new alignment. Trees are then predicted from many of
these alignments of resampled sequences (Felsenstein 1988). For branches in the
predicted tree topology to be significant, the resampled data sets should frequently (for
example, _70%) predict the same branches. Bootstrap analysis is supported by most of
the commonly used phylogenetic inference software packages and is commonly used to
test tree branch reliability. Another method of testing the reliability of one part of the tree
is to collapse two branches into a common node (Maddison and Maddison 1992). The
tree length is again evaluated and compared to the original length, and any increase is
the decay value. The greater the decay value, the more significant the original branches.
In addition to these methods, there are some additional recommendations that increase
confidence in a phylogenetic prediction.
One further recommendation is to use at least two of the above methods
(maximum parsimony, distance, or maximum likelihood) for the analysis. If two of these

25
Phylogenetic analysis- a bioinformatics tool

methods provide the same prediction, confidence in the prediction is much higher.
Another recommendation is to pay careful attention to the evolutionary assumptions and
models that are used for both sequence alignment and tree construction (Li and Graur
1991; Swofford et al. 1996; Li 1997).

Complications from Phylogenetic Analysis:


The above methods provide a further level of sequence analysis by predicting
possible evolutionary relationships among a group of related sequences. The methods
predict a tree that shows possible ancestral relationships among the sequences. A
phylogenetic analysis can be performed on proteins or nucleic acid sequences using any
one of the three methods described above, each of which utilizes a different type of
algorithm. The reliability of the prediction can also be evaluated.
The traditional use of phylogenetic analysis is to discover evolutionary
relationships among species. In such cases, a suitable gene or DNA sequence that
shows just enough, but not too much, variation among a group of organisms is selected
for phylogenetic analysis. For example, analysis of mitochondrial sequences is used to
discover evolutionary relationships among mammals. Two more recent uses of
phylogenetic analysis are to analyze gene families and to trace the evolutionary history
of specific genes. For example, database similarity searches may identify several
proteins in a plant genome that are similar to a yeast query protein. From a phylogenetic
analysis of the protein family, the plant gene most closely related to the yeast gene and
therefore most likely to have the same function can be determined. The prediction can
then be evaluated in the laboratory. Tracking the evolutionary history of individual genes
in a group of species can reveal which genes have remained in a genome for a long time
and which genes have been horizontally transferred between species. Thus,
phylogenetic analysis can also contribute to an understanding of genome evolution.

26
Phylogenetic analysis- a bioinformatics tool

Reference:
Book reference:
Baxevanis Andreas D. & Ouellette B. F. Francis ;BIOINFORMATICS A Practical Guide
to the Analysis of Genes and Proteins (Wiley, 2001).
Brown T.A. Genome 2, Wiley-Liss (2002).
Mount. David W. - Bioinformatics Sequence and Genome Analysis, Cold Spring Harbor
Laboratory Press.

Journal reference:
Adachi, J., and Hasegawa, M. (1996). MOLPHY Version 2.3. Programs for Molecular
phylogenetics based on maximum likelihood (Tokyo: Institute of Statistical Mathematics).
assessing confidence in phylogenetic analysis. Syst. Biol. 42, 182–192.
Barns S.M., Delwiche C.F., Palmer J.D., and Pace N.R. 1996. Perspectives on archaeal
diversity, thermophily and monophyly from environmental rRNA sequences. Proc. Natl.
Acad. Sci. 93: 9188–9193.
Brown J.R. and Doolittle W.F. 1997. Archaea and the procaryotic-to-eukaryote transition.
Microbiol. Mol. Biol. Rev. 61: 456–502.
Comeron J.M. and Kreitman M. 1998. The correlation between synonymous and
nonsynonymous substitutions in Drosophila: Mutation, selection or relaxed constraints?
Genetics 150: 767–775.
Darwin C (1859) The Origin Of Species by Means of Natural Selection, or the Prevention
of Favoured Races in the Strugglr for Life, Penguin Books, London.
Doolittle W.F. 1999. Phylogenetic classification and the universal tree. Science 284:
2124–2128. evolution trees. Mol. Biol. Evol. 9, 945–967.
Eerniss DJ (1998) A brief guide to phylogenetic software.Trends Genet, 14, 473-475.
Felsenstein J. 1981. Evolutionary trees from DNA sequences: A maximum likelihood
approach. J. Mol. Evol. 17: 368–376.
———. 1988. Phylogenies from molecular sequences: Inferences and reliability. Annu.
Rev. Genet. 22: 521–565.
———. 1989. PHYLIP: Phylogeny inference package (version 3.2). Cladistics 5: 164–
166.
———. 1996. Inferring phylogeny from protein sequences by parsimony, distance and
likelihood methods. Methods Enzymol. 266: 368–382.
Feng D.F. and Doolittle R.F. 1996. Progressive alignment of amino acid sequences and
construction of phylogenetic trees from them. Methods Enzymol. 266: 368–382.
Feng, D. F., and Doolittle, R. F. (1996). Progressive alignment of amino acid sequences
and construction of phylogenetic trees from them. Methods Enzymol. 266, 368–382.
Fitch W.M. 1981. A non-sequential method for constructing trees and hierarchical
classifications. J. Mol. Evol. 18: 30–37.
Fitch W.M. and Margoliash E. 1987. Construction of phylogenetic trees. Science 155:
279–284
Hein J. and Støvlbæk J. 1996. Combined DNA and protein alignment. Methods
Enzymol. 266: 402–418.
Henikoff S., Greene E.A., Pietrokovski S., Bork P., Attwood T.K., and Hood L. 1997.
Gene families: The taxonomy of protein paralogs and chimeras. Science 278: 609–614.
Hennig W (1966) Phylogenetic Systematics. University of lllinois Press, Uroana, IL.
Hershkovitz, M.A., and Lewis, L.A. (1996). Deep-level diagnostic value of the rDNA ITS
region. Mol. Biol. Evol. 13, 1276–1295.

27
Phylogenetic analysis- a bioinformatics tool

Higgins D.G., Thompson J.D., and Gibson T.J. 1996. Using CLUSTAL for multiple
sequence alignments. Methods Enzymol. 266: 383–402
Hillis, D. M., Allard, M. W., and Miyamoto, M. M. (1993). Analysis of DNA sequence
data: Phylogenetic inference. Methods Enzymol. 224, 456–487.
Hills DM (1997) Biology recapitulates phylogeny. Science, 267, 218-219.
Hillis, D. M., and Bull, J. J. (1993). An empirical test of bootstrapping as a method for
Hillis, D. M., Huelsenbeck, J. P., and Cunningham, C. W. (1994). Application and
accuracy of molecular phylogenies. Science 264, 671–677.
Huelsenbeck, J. P. (1995). Performance of phylogenetic methods in simulation. Syst.
Biol. 44, 17–48.
Huelsenbeck, J. P., Hillis, D. M., and Jones, R. (1996). Parametric bootstrapping in
molecular phylogenetics. In Molecular Zoology: Advances, Strategies, and Protocols, J.
D. Ferraris and S. R. Palumbi, Eds. (New York: Wiley-Liss), p. 19–45.
Kumar, S., Tamura, K., and Nei, M. (1994). MEGA: Molecular Evolutionary Genetics
Analysis software for microcomputers. Comput. Appl. Biosci. 10, 189–191.
Lake, J.A. (1994) Proc. Natl.Acad. Sci.U. S.A. 91,1455–1459
Li W.-H. 1993. Unbiased estimation of the rates of synonymous and nonsynonymous
substitution. J. Mol. Evol. 36: 96–99.
———. 1997. Molecular evolution. Sinauer Associates, Sunderland, Massachusetts.
Li W.-H. and Graur D. 1991. Fundamentals of molecular evolution, pp. 106–111. Sinauer
Associates, Sunderland, Massachusetts.
Li W.-H. and Gu X. 1996. Estimating evolutionary distances between DNA sequences.
Methods Enzymol. 266: 449–459.
Li W.-H., Wu C.I., and Luo C.C. 1985. A new method for estimating synonymous and
nonsynonymous rates of nucleotide substitution considering the relative likelihood of
nucleotide and codon changes. Mol. Biol. Evol. 2: 150–174.
Lockhart, P.J. et al. (1994) Mol. Biol. Evol. 11,605–612
Maddison W.P. and Maddison D.R. 1992. MacClade: Analysis of phylogeny and
character evolution (version 3). Sinauer Associates, Sunderland, Massachusetts
Martin W. 1999. Mosaic bacterial chromosomes: A challenge en route to a tree of
genomes. Bioessays 21: 99–104.
Mayr E. 1998. Two empires or three? Proc. Natl. Acad. Sci. 95: 9720–9723
McDonald J.H. and Kreitman M. 1991. Adaptive protein evolution at the Adh locus in
Drosophila. Nature 351: 652–654.
Miyamoto M.M. and Cracraft J. 1991. Phylogenetic analysis of DNA sequences. Oxford
University Press, New York.
Nielsen R. and Yang Z. 1998. Likelihood models for detecting positively selected amino
acid sites and applications to the HIV-1 envelope gene. Genetics 148: 929–936.
Nuttal GHF (1904) Blood Immunity and Blood Relationship Cambridge Unerversity
Press, Cambridge.
Rzhetsky, A., and Nei, M. (1992). A simple method for estimating and testing minimum
Rzhetsky, A., and Nei, M. (1994). METREE: A program package for inferring and testing
minimum-evolution trees. Comput. Appl. Biosci. 10, 409–412.
Saitou N. 1996. Reconstruction of gene trees from sequence data. Methods Enzymol.
266: 427–449
Saitou N. and Nei M. 1987. The neighbor-joining method: A new method for
reconstructing phylogenetic trees. Mol. Biol. Evol. 4: 406–425.
Schadt E.E., Sinsheimer J.S., and Lange K. 1998. Computational advances in maximum
likelihood methods for molecular phylogeny. Genome Res. 8: 222–233.
Snel B., Bork P., and Huynen M.A. 1999. Genome phylogeny based on gene content.
Nat. Genet. 21: 108–110.

28
Phylogenetic analysis- a bioinformatics tool

Stewart, C-B. (1993) Nature 361, 603–607


Swofford D.L., Olsen G.J., Waddell P.J., and Hillis D.M. 1996. Phylogenetic inference. In
Molecular systematics, 2nd edition (ed. D.M. Hillis et al.), chap. 5, pp. 407–514. Sinauer
Associates, Sunderland, Massachusetts.
Woese C.R. 1987. Bacterial evolution. Microbiol. Rev. 51: 221–271

29

You might also like