Statistical Inference Methods For Determining Phylogenetic Trees

Reubyn William Chong Dr.
Sayan Mukherjee STAT 113: Statistics for Engineers April 28, 2011 Statistical Inference Methods for Determining Phylogenetic Trees Phylogenetics is a branch of evolutionary biology that deals with the relatedness of species and the formation of evolutionary lineages of species. Apart from determining the ways in which species have diverged in evolutionary time, phylogenies can be used to create a history of the development of morphological traits, show migration patterns of organisms, and predict the emergence of new disease strains. Phylogenetic tree creation is a complex process involving large amounts of data; therefore it is common for computer programs to use statistical tests to calculate the tree of best fit. The construction of phylogenetic trees requires statistical inference on a vast amount of morphological or genomic data. Taxa with similar phenotypic characteristics or DNA sequence homology will likely be closely related within the same monophyletic group. Basic evolutionary models used to create a tree of life include distance methods and parsimony. Distance methods utilize nucleotide base differences within genetic information to design a tree where branch lengths are proportional to the amount of base differences. The method of parsimony requires that the tree best fitted to the data would have the least assumptions of changes in morphological features or genetic mutations during its evolutionary history. The most parsimonious tree can be tested through a variety of approaches; among the most common are maximum likelihood methods or Bayesian Markov Chain Monte Carlo (BMCMC) inference methods.
The maximum likelihood approach computes the probability that a phylogenetic tree is correct given a certain set of traits for each taxa or single-nucleotide polymorphisms (SNPs) in DNA sequences between taxa. First, every possible tree is created based on the evolutionary data. The likelihood of each of these trees is calculated by multiplying the likelihood of a node by the product likelihood of its branches.
Equation 1: Likelihood Calculation for a Path Within a Tree
The tree with the highest probability is deemed the best via the maximum likelihood method. The method has several advantages since it works well with data from distantly related sequences; maximum likelihood allows the use of a variety of evolutionary models of tree construction. However, this approach has shortcomings in phylogenetics because of its rigorous computations needed; the method also fails when inference of larger trees is involved. Often in phylogenetic inference, a lack of data makes tree creation difficult. Without a large amount of data, it is impossible to determine the population distribution from the sample. When this is the case, a technique known as bootstrap estimation is used. A new sample is drawn independently and identically from the small pool of data. The randomness of this new sample provides a better representative distribution. Bootstrapping is also useful for assessing the validity of a node or branch in a given tree. In this process, the frequency at which the part of the tree of interest within bootstrap samples occurs determines its appropriateness. When a branch occurs in over fifty-percent of bootstrapped samples it is likely that the branch belongs to the true phylogenetic tree.
Bayesian inference is a more recently established statistical measure of phylogenies. It can give a better depiction of the best tree since prior information is used. This prior information or distribution of the goodness of tree can create a more accurate posterior distribution from which the expectation value can be found. The Bayes formula involves an integration of each tree over all different branch lengths and node placements followed by a summation over all possible combination of trees.
Equation 2: Bayes Theorem for Posterior Distribution of Trees, Where Pr(Tree) is the Prior
Since an analytical solution is impossible for large trees, numerical approximations are used instead. The most common of these is the Markov Chain Monte Carlo (MCMC) method. The MCMC approach attempts to iteratively test all possible trees for their fit. First a branch is moved randomly on the tree or perturbed. If this change is not statistically preferred, the new tree will be scrapped. If this change improves the tree, this change will be kept, and other branches will be moved around. Through the sampling of all possible trees, a posterior distribution is created based on the frequency each tree is visited in the stochastic perturbation process. The more often a tree is sampled, the greater the likelihood it would have within the posterior distribution. The Bayesian method is commonly used for phylogenetic inference. The disadvantage of this method is that use of a prior distribution can lead to biased inference. Inference of larger phylogenetic trees can be problematic. Bayesian and MCMC methods can lead to convergence of a wrong tree during the tree perturbation process. This problem arises from complicated data, such as multiple base-substitutions at a single SNP. The problem can be solved by running the perturbation chains for different lengths of time. Otherwise,
maximum likelihood methods could fix the convergence problem given that the tree sample is not too large. Therefore, for large tree inference, bootstrap estimation, used to create smaller samples, is performed on the posterior distribution in order for MLE tests to be conducted. In conclusion, phylogenetic inference is a complicated, computationally-intensive process that utilizes many statistical tests and evolutionary models. Distance methods and parsimony are good strategies for simple trees, but formal statistical computer programs are required for trees of more massive data. Maximum likelihood methods are computationally intensive but are generally reliable as long as the tree is not too large. Bayesian methods, which work for large data sets, involve the creation of tree perturbation chains in order to assemble a posterior distribution. Bootstrap estimation is often used in phylogenetics in order to create a tree sample representative of the population distribution of all possible tree combinations.
References S. Freeman, J.C. Herron. Evolutionary Analysis. 4th edition. Pearson Prentice Hall. 2007 J.P. Huelsenbeck, F. Ronquist, R. Nielsen, J.P. Bollback. Bayesian Inference of Phylogeny and Its Impact on Evolutionary Biology. Science. 14 December 2001.
http://www.sciencemag.org/content/294/5550/2310.full
Wikipedia, The Free Encyclopedia. Maximum Parsimony (phylogenetics). 12 April 2011. 1 May 2011. http://en.wikipedia.org/wiki/Maximum_parsimony_(phylogenetics)

Statistical Inference Methods For Determining Phylogenetic Trees

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Statistical Inference Methods For Determining Phylogenetic Trees

Uploaded by

Copyright:

Available Formats

Reubyn William Chong Dr.

Equation 1: Likelihood Calculation for a Path Within a Tree

You might also like