Professional Documents
Culture Documents
net/publication/236042895
CITATIONS READS
425 1,575
1 author:
Barry G. Hall
Bellingham Research Institute
179 PUBLICATIONS 5,676 CITATIONS
SEE PROFILE
All content following this page was uploaded by Barry G. Hall on 29 October 2015.
Abstract
Phylogenetic analysis is sometimes regarded as being an intimidating, complex process that requires expertise and years
of experience. In fact, it is a fairly straightforward process that can be learned quickly and applied effectively. This
Protocol describes the several steps required to produce a phylogenetic tree from molecular data for novices. In the
example illustrated here, the program MEGA is used to implement all those steps, thereby eliminating the need to learn
several programs, and to deal with multiple file formats from one step to another (Tamura K, Peterson D, Peterson N,
Stecher G, Nei M, Kumar S. 2011. MEGA5: molecular evolutionary genetics analysis using maximum likelihood, evolu-
tionary distance, and maximum parsimony methods. Mol Biol Evol. 28:2731–2739). The first step, identification of a set of
homologous sequences and downloading those sequences, is implemented by MEGA’s own browser built on top of the
Google Chrome toolkit. For the second step, alignment of those sequences, MEGA offers two different algorithms:
ClustalW and MUSCLE. For the third step, construction of a phylogenetic tree from the aligned sequences, MEGA
Protocol
today the purposes have expanded to include understanding desired. MEGA5 is, thus, particularly well suited for those
the relationships among the sequences themselves without who are less familiar with estimating phylogenetic trees.
regard to the host species, inferring the functions of genes
that have not been studied experimentally (Hall et al. 2009), Step 1: Acquiring the Sequences
and elucidating mechanisms that lead to microbial outbreaks Ironically, the first step is the most intellectually demanding,
(Hall and Barlow 2006) among many others. Building a phylo- but it often receives the least attention. If not done well, the
genetic tree requires four distinct steps: (Step 1) identify and tree will be invalid or impossible to interpret or both. If done
acquire a set of homologous DNA or protein sequences, (Step wisely, the remaining steps are easy, essentially mechanical,
2) align those sequences, (Step 3) estimate a tree from the operations that will result in a robust meaningful tree.
aligned sequences, and (Step 4) present that tree in such a Often, the investigator is interested in a particular gene or
way as to clearly convey the relevant information to others. protein that has been the subject of investigation and wishes
Typically you would use your favorite web browser to to determine the relationship of that gene or protein to its
identify and download the homologous sequences from a homologs. The word “homologs” is key here. The most basic
national database such as GenBank, then one of several align- assumption of phylogenetic analysis is that all the sequences
ment programs to align the sequences, followed by one of on a tree are homologous, that is, descended from a common
many possible phylogenetic programs to estimate the tree, ancestor. Alignment programs will align sequences, homolo-
and finally, a program to draw the tree for exploration and gous or not. All tree-building programs will make a tree from
publication. Each program would have its own interface and that alignment. However, if the sequences are not actually
its own required file format, forcing you to interconvert files descended from a common ancestor, the tree will be
ß The Author 2013. Published by Oxford University Press on behalf of the Society for Molecular Biology and Evolution. All rights reserved. For permissions, please
e-mail: journals.permissions@oup.com.
2
Building Phylogenetic Trees . doi:10.1093/molbev/mst012 MBE
Add to Alignment button (a red cross) near the top of the blast or protein blast link to get to the page identical to
window. that described earlier. Everything is the same as when using
Step 1.31. When you click the Add to Alignment button, MEGA5’s browser except that you cannot click a convenient
MEGA5’s Alignment Explorer window opens and the se- button to add the sequences to the Alignment Editor.
quence is added to that window. After adding a sequence Step 1.52. Open a new file in a text editor. You can use
to the Alignment Explorer use the back arrow in the BLAST MEGA5’s built in text editor by choosing Edit a Text File
window to return to the list of homologous sequences and from the File menu. That editor has several functions for
add another sequence of interest. editing molecular sequences, including reverse complementing
Step 1.4: Protein Sequences and converting to several common formats including Fasta.
The main difference from nucleotide searches is that you may Alternatively, use Notepad for Windows or TextWrangler
see accession number links to several protein sequence files. for Mac (http://www.barebones.com/products/textwrangler/).
These all have the same amino acid sequence, although their Save the file with a meaningful name with the extension.fasta,
underlying coding sequences may differ. Click any one of the for example, myfile.fasta. Do not use Microsoft Word,
links to bring up the protein sequence file, then click the Add Word Pad, TextEdit (Mac), or another word processor!
to Alignment button. Step 1.53. When you have identified the sequence that you
Things that May Go Wrong want to add and clicked the link to take you the page for that
sequence file, adjust the Region Shown and Customize View if
1) You may find that all the hits that are returned from your necessary. Notice the Display Settings link near the top left of
search are from very closely related organisms; that is, if the page. The default setting is GenBank (full). Change that to
3
Hall . doi:10.1093/molbev/mst012 MBE
Align Codons. If your sequence is a DNA coding sequence it Alignment Explorer you can export the unaligned sequences
is very important to choose Align Codons. That will ensure in FASTA format by choosing Export Alignment from the
that the sequences are aligned by codons, a much more real- Data menu, then choosing FASTA format. If you forgot to
istic approach than direct alignment of the DNA sequences keep the unaligned sequences you can select all the sequences
because that avoids introducing gaps into positions that (Control-A), then choose Delete Gaps from the Edit menu
would result in frame shifts in the real sequences. before you export the sequences in FASTA format.
Step 2.2 Step 3: Estimate the Tree
Choosing an alignment method opens a settings window for
that method. For MUSCLE, I recommend that you accept the There are several widely used methods for estimating phylo-
default settings. For ClustalW, the default settings are fine for genetic trees (Neighbor Joining, UPGMA Maximum
DNA, but for proteins, I recommend changing the Multiple Parsimony, Bayesian Inference, and Maximum Likelihood
Alignment Gap Opening penalty to 3 and the Multiple Align- [ML]), but this article will deal with only one: ML.
ment Gap Extension penalty to 1.8. Step 3.1
In MEGA5’s main window choose Open a File/Session from
Step 2.3 the File menu and open the .meg file that you saved in Step 2.
Click the OK button to start the alignment process. Depend-
ing on the number of sequences involved and the method Step 3.2
you chose, alignment may take anywhere from a few seconds ML uses a variety of substitution models to correct for multiple
to a few hours. When the alignment is complete Save the changes at the same site during the evolutionary history of the
4
Building Phylogenetic Trees . doi:10.1093/molbev/mst012 MBE
FIG. 2. An ML tree.
The yellow areas are parameters that you can modify. You To perform the bootstrap test return to the Analysis
5
Hall . doi:10.1093/molbev/mst012 MBE
information that is external to the sequences themselves,
that is, an outgroup. An outgroup is a sequence that is
more distantly related to the remaining (ingroup) sequences
than they are to each other. We cannot infer an outgroup
from the tree itself, so we turn to other information. For the
sequences in figure 2 we know that Pseudomonas aeruginosa
belongs to the order Pseudomonadales, whereas the remain-
ing organisms belong to the order Oceanospirillales, both of
the class Gammaproteobacteria,. Thus, P. aeruginosa is a
legitimate outgroup for the remaining sequences.
Step 4.21. We can root the tree on P. aeruginosa by using
the rooting tool that is found in the sidebar of the Tree
Explorer widow (fig. 3). In either the Rectangular Phylogram
view or the Radiation view, while the rooting tool is selected,
click on the branch leading to P. aeruginosa to root the tree
on that sequence as shown in figure 4.
The rooted tree in figure 4 now correctly implies the dir-
ection of evolution of those sequences. When the tree is
6
Building Phylogenetic Trees . doi:10.1093/molbev/mst012 MBE
7
View publication stats