You are on page 1of 49

1

Quantitative Cladistics and Use of TNT


All Rights Reserved Pablo A. Goloboff Instituto Superior de Entomologa, CONICET Facultad de Ciencias Naturales e Instituto Miguel Lillo, Miguel Lillo 205, 4000 S.M. De Tucumn The data sets for these exercises are distributed electronically, as part of a 5-day course in Cladistics. do not distribute this handout!

Introduction
Throughout, the general format for the assignments is: file or folder names are indicated as filename TNT commands are indicated as command (help with help command;) Menu choices (available only for Windows) are indicated as Choice In every case, unless otherwise specified, start by reading example.tnt (File/OpenInputFile) and calculating most parsimonious trees (note: most parsimonious trees can be calculated with Analyze/TraditionalSearch, or with command mult; using default settings in either case). Save all files to a mon_mor folder, to keep machine clean. 1 Create an output file called trees.out (File/Output/OpenOutputFile) and write tree diagrams for trees 0, 4, and 6 (Trees/DisplaySave, and select the trees you want to save). In the same file, include a table (default format) with the lengths for all trees (Optimize/TreeLengths), and a table (optional format, set with Format/OptionalTableFormat) for the score of characters 10-20 in trees 3-4. 2 Create a tree-file in compact notation (File/TreeSaveFile/OpenCompactMode), called example.ctf. Save trees to that file (File/TreeSaveFile/SaveTreesToFile), and close it (File/TreeSaveFile/CloseTreeFile). Create another tree-file, in parenthetical notation (File/TreeSaveFile/OpenParenthetical), called example.tre. Save trees, using taxon numbers (set with Format/UseTaxonNames), and close the file. Then create a third file in parenthetical notation, called taxnames.tre, and save the trees, but using taxon names. Exit the program, enter again, and re-read the trees from each of the files; confirm that the trees are identical (can be done with Trees/TreeBuffer/Filter, with defaults, which simply discards duplicate trees, or Trees/TreeBuffer/CompareTrees, which provides a list of non-unique trees). What is the difference in size for the files example.ctf and example.tre? When is it advisable to save the trees using taxon names, instead of numbers? 3 (Windows only) Create a metafile, example.emf, to include a drawing of tree 3. Create a PowerPoint file, and copy the image from example.emf in one of the slides. There are two ways to do this. First one is manual. For this, make sure "tree-preview" is ON (with Format/PreviewTrees). Then, display tree-diagram (as you did for exercise 1). When in the previewing screen, press "M" (for "metafile"). Second way to do this is automatic. For this, open the metafile first, with File/Output/OpenMetafile (or with log & example.emf;); this automatically sets the preview as OFF, so that you will not need to be there to press any keys for execution to continue after saving the tree-diagram to the metafile. Then, display tree-diagram (as you did for exercise 1); this automatically writes the tree diagram to the metafile. Then, close the metafile (File/Output/CloseMetafile, or log /&;). 4 Read the data set from contin.tnt. Create a random tree. Then, edit the tree (manually, in treeview mode, which you get by clicking on the button with the eye and the tree, in Windows, or with the edit command, in other versions). Make the tree ( B ( C ( D E ) ( F G H ) ) ); save tree diagram to an output file, contin.out. Then, edit tree again, to now include J as sister group of G, and K as sister group of H, and save tree diagram to output file.

3 5 Create a file with instructions, instructs, for TNT to do the following task(s): a) b) c) d) e) open a log file, automatic.out read the data from example.tnt calculate most parsimonious tree(s) save consensus of all trees (Windows only) open a metafile, automatic.emf, and save the consensus to it f) calculate length of all trees found g) exit the program Create a batch file, automatic.bat (under Windows) or a script automatic (under Linux/Mac), which calls TNT and makes it read (=execute) the instructions in file instructs. The commands to use here are: log, procedure, mult, tplot, length, quit. You can get help on the syntax for command xxx by typing help xxx at the command line. 6 (Windows only) Create a file with batch-menu instructions, instructs.bmn, for TNT to do the same tasks as in point 5. Create a batch-file, autobatch.bat, which calls TNT and makes it read (and execute) all the instructions in instructs.bmn. (procedure). 7 Make non-additive characters 22, 26, and 92; make additive characters 42 and 34 (Data/CharacterSettings, the same can be done with the ccode command) Calculate most parsimonious tree(s). What's the resulting length? (Note: should be 383). Without exitting the program or re-reading the data, create a character-state-tree for character 101:
0 / 2---3 ---1 \ 4

Re-calculate most parsimonious trees; what's the resulting length? (note: should be 389). 8 Create a file, myhelp.txt, which contains a list of all the TNT commands, and a brief description of the options for all commands. 9 With a text editor, fuse the data sets in ..\dsets\part_a.tnt and ..\dsets\part_b.tnt (imaginary molecular and morphological data sets). Save the single data set to a file mixed.tnt. Make sure the ccodes are properly adapted, using the @ option. If so, the minimum length should be 699 (although superficial searches may produce trees of 700 steps, or even 701). 10 On Friday, we will see scripts; scripts can be used to produce special color diagrams. An example is in labeled.tnt, which contains a list of names and a tree (the shortest tree found by Goloboff et al., 2009, for mammals). The taxon names in the data set contain the full hierarchy of mammalian classification, which can be processed with the scripts dohi.run and colorgroups.run (copied to the ../ monday folder). Reading that data set, and typing dohi _taxon_A; will display the group in the tree closest to the taxon specified (in Windows, tree-previewing must be turned off for this). In Windows, colorgroups taxon_A taxon_B; will display a tree-diagram, where branches of taxon A and B are shown with different colors (up to 10 groups can be shown; tree-previewing must be turned ON). This can be used to facilitate checking results of an analysis.

Optimization
1 read data set in example.tnt, and calculate most parsimonious tree(s). Map characters onto trees, using: (a) (Windows only) color codes (b) numbers to indicate states (c) state names 2 (Windows only) Then, create a metafile, colormap.emf, which includes a color mapping of the character named male_spur. Use thick-branches to make sure the colors are well-visible. 3 On tree number 0, count the number of minimum-maximum possible transformations (change, Optimize/SpecificChanges) for character male_spur. Count the number of losses and the number of gains (these should be: losses, 7-8, gains 2-5). 4 For the following data set: A B C D E F G H I J 0 1 2 3 4 5 6 7 8 9

(a non-additive character), create a random tree, and count the number of possible reconstructions (recons, Optimize/Characters/Reconstructions). How many there are? (note: there should be over 4,000!). 5 For the data set example.tnt, find most parsimonious trees, and then calculate the strict consensus (with Trees/Consensus, plotting node numbers naked controls whether node numbers are plotted or not, as well as Format/ShowNodeNumbers). Calculate the synapomorphies common to all the trees. What are the common synapomorphies for the node common to taxa named L_ (=Lycinus) and D_ (=Diplothelopsis)?? (probably this node is numbered 104 in the consensus, depending on how you did your search). Are characters 22, 45, 46, 64, 85, and 102 a synapomorphy of that group in any of the shortest trees? 6 For the same situation above, produce the common mapping of character 22 (maxillary_cuspules) onto the most parsimonious trees (Optimize/Characters, or map[ option). If the consensus is optimized as such, then the character changes without ambiguity from 1 to 0 in the node common to Lycinus and Diplothelopsis. Confirm this, and confirm that there is a most parsimonious tree where that change does not occur (or does not occur unambiguously).

5 7 For the data in example.tnt, create 10 random trees (random). Sort them, from best to worst (sort). Calculate the lengths. Then, retain the shortest of the random trees; what's it's length, compared to the length of most parsimonious trees? 8 - Read data from example.tnt, and then read the trees from ..\dsets\mixture.ctf. (shortread, File/ReadCompactTreeFile). Calculate tree-lengths (length, Optimize/TreeLength); some trees have length 382 (minimum), others 384 and 385, and some trees are very long (they're random trees). Create tree-groups (Trees/TreeGroups) for each of the length-classes, so that the groups are named: 1. 2. 3. 4. 5. "shortest": including all trees of length 382 "medium": including all trees of length 384 "longer": including all tree of length 385 "random": including all trees of length greater than 385 "notsobad": including the trees from groups 1 and 2.

Then, use the groups created to output tree lengths (which, by the way, provides confirmation that the groups were properly created). If you don't have problems defining the tree-groups with the menu interface, repeat tree-group definition in a file, using commands (tgroup command). 9 Read data from example.tnt, and then read the trees from ..\dsets\mixture.ctf (shortread, File/ReadCompactTreeFile). Condense the trees (condense, Trees/TreeBuffer/CondenseTrees, with default settings; settings are controlled with collapse, Settings/CollapsingRules). Produce a table of the number of nodes of all the trees (tnode, Trees/Describe/NumberOfNodes).

Tree Searches
1 read data from example.tnt. Deactivate all taxa, except the first 20 (i.e. taxa 0-19). Calculate an exact solution. Compare the results with 100 random addition sequences, saving up to 10 trees per replicate. Is the heuristic search likely to have found the actual minimum length for the first 20 taxa? why? Add taxa one by one, and compare the times required for exact solutions with 21, 22, 23, etc. taxa, untill an exact solution cannot be achieved in about 5-10 minutes. Then, run 100 random addition sequences with up to 10 trees per replicate; is it likely that there are shorter trees? 2 read the data set set in tbrdemo.tnt, which automatically calls the script dotbr.run script (in the dsets directory). This must be run with the character-mode version, and is a graphical demonstration of how a tree-search proceeds in practice. 3 read data from fam.tnt. What's the length of the shortest trees? How many distinct trees there are? How many TBR islands there are? Find the best trees where Ummidia+Calathotar+Heteromigas+ Actinopus+Plesiolena+Idiops+Neocteniza+Misbolas do not form a monophyletic group, the best trees where Stenoteromm+Acanthogona do not form a monophyletic group, and the best trees where MECICOBOT+ATYPIDA form a monophyletic group. What are the lengths in every case? (should be: 228, 229, 235) . Then find a tree where all those constraints are satisfied at the same time; why is the difference in minimum possible length? (should be 238). 4 read data from example.tnt. set collapsing rule to "max. length = 0" (rule 3). What would be the best strategy for finding all the equally most parsimonious trees under that setting? 5 read data from tricky_1.tnt. With collapsing as set in the file (to be seen in a future class), there are 864 distinct trees. In general, it might be expected that, as more more trees are saved in each of several rand-add-seqs, it becomes more likely to find all the equally parsimonious trees than if one saves a few trees per replication. Test this idea by running two different analyses: first, run 20 replications of a random addition sequence, saving up to 216 trees/replicate. Then, run 20 replications, saving up to 430 trees/replicate. If you want to make sure of the differences, run several times, changing the random seed, or using time as random seed (rseed 0;). Which of the two alternatives finds all the trees? Is this in agreement with the expected results? Why? Compare this with the number of trees found if, after completing the rand-add-seqs, global TBR is performed starting from the trees found. 6 Read the data from tricky_2.tnt. How can you explain that the exact solution (finding all the trees of minimum length: 10,395 trees of 306 steps) of this matrix can be done much faster (about 50x) than a heuristic solution (from a single starting point, saving all possible trees)? 7 Read the data in tricky_2.tnt, set the collapsing to "min. length = 0" (=rule 1), and then compare the running times for a single starting point for TBR saving up to 1000 trees, with the running times of TBR saving up to 11,000 trees. The first finds (and swaps) 1000 trees in X secs. The second finds (and swaps) 10,395, which is about 10 times more. The second, however, doesn't take 10 times more, but instead takes several hundred times more. Why the fuck is that? 8 Read the data in zilla.tnt. Compare the results of searching with three different strategies:

7 a) multiple random addition sequences, saving up to 10 trees per replication b) a single random addition sequence, saving up to 10,000 trees c) as in (b), but setting collapsing to "none". in all cases, set the timeout to 3 minutes (with timeout 3:0 or with Analyze/Timeout), so that all searches use the same amount of time. Change the random seed between searches, and repeat several times. We will try to calculate grand totals for all the lab. 9 Just for fun: read the data set from hel.tnt. That is a relatively difficult data set, with 854 taxa. The minimum length is 23005. Using only traditional strategies, try to find trees as short as you can, using only traditional search strategies. Produce a log, where the status is saved every 30 secs. to a file, called heltrad.out (controlled with report, or Settings/ReportLevels). This will be compared to the results one can obtain using new strategies, in future classes.

Ambiguity, Consensus, Tree-Collapsing, Comparing Trees


1 Read data from example.tnt, and find all equally parsimonious trees under the default collapsing rule, "min. length = 0" ( =rule 1). How many there are? Calculate the strict consensus tree; how many nodes does it have? (nelsen, tnodes, or Trees/Consensus and Trees/Describe/NumberOfNodes). Now set the collapsing rule to "max. length = 0" (=rule 3). Try to find all equally parsimonious trees how easy is it? With all the trees you could find, calculate the strict consensus; how many nodes does it have? 2 Read data from example.tnt (make sure settings are default ones before reading the data set). Do a few random addition sequences (changing random seeds, if necessary) saving up to 10 trees/ replicate. This should find 10 trees of 382 steps (we know from previous exercise that there are 72 such trees). Calculate the strict consensus, and count its number of nodes. Now, set collapsing to "min. length = 0" (=rule 3) and calculate the consensus of the 10 most parsimonious trees (make sure you don't include the previous consensus among the input trees!); count number of consensus nodes. Lastly, turn temporary collapsing off, calculate consensus of most parsimonious trees, and count numbers of nodes. What are the differences in number of nodes? Which result should be reported? To conclude, select any of the most parsimonious trees, and calculate the consensus from that tree alone (not available from menus; you can use Trees/TreeBuffer/Condense) with temporary collapsing using TBR. Count the number of nodes. 3 Read data from liebherr.tnt. That is a medium-sized data set, quite difficult for its size. Read file liebherr.ctf, which includes a number (just a sample, from 10 independent hits, some of which found several trees) of most parsimonious trees for that data set. Turn temporary collapsing off. Calculate the consensus; is it possible to improve the resolution of the consensus by ignoring the position of some taxa? Concentrate on nodes 173, and 196 (the others have no hopes of resolution). (prunnelsen, Trees/Comparisons/PrunnedTrees). Calculate the consensus excluding the taxa, but "show the location of the pruned taxa" on the reduced consensus. An alternative strategy to find taxa that decrease resolution in the tree is by using TBRtracking. This is not implemented in the menus, but the syntax is simple enough. If you have in memory only the 24 trees in liebherr.ctf, then you can type: chkmoves [/5 > 0 ; nelsen // {0} ; at the command line (the chkmoves command does TBR-tracking, placing in taxon-group number 0 the taxa that can be moved to 5 or more nodes away during TBR; the nelsen command calculates the consensus pruning taxa in group 0; the double slash indicates that the placement of the pruned taxa on the main tree must be shown). This produces results rather similar to those of the previous option. Looking at the consensus how many potential resolutions there are for nodes 173 and 196? How many different resolutions are actually found for those nodes? (resols, Trees/Comparisons/ ShowResolutions). 4 Read data from liebherr.tnt, and read trees from liebherr.ctf. Make sure temporary collapsing is ON, and collapsing rule is set to "min. length = 0" (=rule 1). Find the number of groups present in tree 0 but not tree 1; save tree diagram, and count of number of nodes, to a file called anticons.out. Now compare trees 0 and 2. Are the counts the same? What may be causing the difference in both comparisons?

9 5 Read data from fam.tnt, and calculate most parsimonious trees. Condense them. Calculate the agreement subtree. How many taxa are included? (note: PAUP* often provides a better agreement subtree than the heuristic in TNT, but for rooted trees sometimes as in this case it reports some taxa that should actually be absent from the agreement subtree). 6 Read the data from badagree.tnt. The trees have 49 taxa. Calculate the agreement subtree. How many taxa there are in the agreement subtree? This would normally be taken to indicate a high or low similarity between the trees? Is the interpretation correct? Calculate the strict consensus trees, and calculate the SPR distances between the two trees (sprdiff, Trees/Comparisons/SPRDistances). Which is a better measure of similarity, in this case? 7 A difficult one read the data from funny.tnt., with 18 taxa. View the matrix. As it is easy to see, there are several distinct trees, resulting from alternative placements of taxon X. Compare the results of: 1) Find all trees (using an exact solution). Set collapse to "min. length = 0" (=rule 1). Condense the trees. Then, since we know that the floating taxon is X, we can calculate the strict consensus without X (manually excluding that taxon from the consensus calculation, from the menus, or using nelsen /x; make sure you turn temporary collapsing off). 2) Find all trees. Set collapse to "min. length = 0" (=rule 1), turn temporary collapsing ON, and calculate the strict consensus without X (manually excluding that taxon from the consensus calculation, from the menus, or using nelsen /x). 3) Find all trees. Set collapse to "max. length = 0" (=rule 3). Condense the trees. As in point (1), calculate the strict consensus without X. What is the consensus for options (1), (2) and (3) ?? Why is the difference? What does this tell you about the way in which TNT temporarily collapses the trees during consensus calculation? 8 Read (again) the data set from funny.tnt; find all equally parsimonious trees. Calculate the majority rule consensus tree. What does it tell you about the position of X? Is that a meaningful conclusion? 9 Read example.tnt, and then read the 4 trees from file incomple.tre. View trees 1-3; they have different taxon subsets. Try to calculate the supertree by hand. Then calculate the semi-strict supertree, and compare results. Lastly, calculate the MRP tree (mrp, or Trees/TreeBuffer/CreateMRP). Repeat, but using now trees 0-3 (which have conflict, and should be much harder to do by hand). 10 Read example.tnt, and then read the 2 trees from file sprdiffs.tre. This contains two trees, the strict consensus of which is poorly resolved. Calculate the SPR-moves necessary to convert the first tree into the second (sprdiff, or Trees/Comparisons/SPRdifferences, stratifying calculations in 1 level). What's the number of SPR moves? How similar are the two trees, according to this criterion?

10

Character weighting
1 Read data from example.tnt. Set implied weighting ON, with concavity 5. Find all optimal trees; how many there are? What's their score? (should be 54 trees with score 25.37908). Save to a file, sensitive.ctf, the consensus of the trees optimal under concavities 5-8. What are the differences? If there are differences, what is the implication, in terms of credibility for the different groups? 2 Turn implied weighting ON (so that you can subsequently define user weighting functions), and read the data set in file wtfuncs.tnt. It contains two blocks of data, "random" and "perfect" (with obvious, literal meanings). Turn implied weighting OFF, and find minimum length trees. Calculate the length for the block of good data only (L1). Note the block of random data can be easily deactivated with: ccode ] @random . ; from the command line. Alternatively, you can use block = structured, or Data/CharacterGroups/ Blocks/ActiveBlocks. Find the best possible length for the structured data alone. Then activate all blocks, and turn implied weighting ON. Unsurprisingly, when you search, the searches quickly converge to the topology determined by the structured data. Now, specify a user weighting function where the weight increases with the homoplasy; use weights 1.0, 1.05, 1.10, 1.15, 1.20, 1.25, 1.30, 1.40, 1.50, 1.60, 1.65, 1.70, 1.80, 1.90, and 2.00 (for 0 to 14 extra steps, the maximum possible on this data set). Find the trees of minimum score (should be a single tree of 549.75). Calculate the length for the block of good data; how does this length compare to the length L1? Is this expected? 3 With a user-defined weighting function, find the compatibility trees for the data set in example.tnt (primary clique only). Calculate the strict consensus and count number of supported nodes; this explains why clique analysis was abandoned as a method to estimate phylogenies. Try to also calculate secondary and tertiary cliques how would you define the weighting function for that? 4 (very tricky!) Read the matrix in autowts.tnt. Run under equal weights, and under implied weights (with k=6). Then, turn auto-weighted optimization ON, with concavity constant k=6. With auto-weighted optimization, TNT will weight transformations 01 and 10 separately. Run again. Note the difference in the relationships between M_HUARIA, LUTEA y BISCA. Why is that difference? (a little help: it has to do with characters 0, 1 and 2. 0 and 1 have identical distribution. Mapping these characters onto the optimal trees for each criterion may help understand the difference. Finding slightly suboptimal trees may also help.). 5 Read the data from example.tnt. Use the script supwt.run to do iterative reweighting of the data (as described in Farris, 2002). First, run support-weighting (default option); run 15 replications, making no more than 6 rounds per replication. Then, run successive-weighting, weighting against homoplasy. Compare the results. Is there a difference in the time used for each of the methods? What do you think this may suggest? 6 Read the data from example.tnt. Use the script rewt.run (it comes with the program) to perform successive weighting.

11

Support measures. Bremer support. Resampling. Effect of type of searches and collapsing methods.
1 Read the data from example.tnt. Calculate most parsimonious trees (72 trees, 382 steps). Then, produce two estimations of bremer supports: a) search for trees with up to 15 steps longer than the best (save up to 10,000 trees; make sure you set bb: fillonly, or click in stop when maxtrees hit). b) search for trees in steps: 1x1000, 2x2500, 3x5000, 4x7500, 6x10,000. Produce the estimation of supports in each case; concatenate the tree messages in one target tree, so that the values can be easily compared. Which values should be preferred? 2 For the data in example.tnt, plot on a tree the absolute bremer support, the relative bremer supports (considering only trees within absolute supports), and the values of symmetric resampling. Are the measures generally correlated? 3 For the data in fam.tnt, a researcher had proposed the phylogeny in tree-file previous.ctf. Evaluate whether that researcher had proposed reasonable hypotheses using 1. bootstrapping and absolute frequencies 2. symmetric resampling and GC values The biggest difference is in the placement of PARATROPIDI. What can be concluded if the positition of that taxon is disregarded? 4 Read the data in funny.tnt. Calculate values of symmetric resampling: collapsing with TBR, searching with one rand-add-seq (saving a single tree), for each resampled matrix (=pseudoreplicate). collapsing with "min. length = 0" (=rule 1), searching as in (a) as in (b), but saving up to 3, 15, 20, and 100 trees per pseudoreplicate. what is causing the differences? Which are the correct values? If PAUP* is available on the lab machines, calculate bootstrap values (boot search=heuristic;). 5 For data in liebherr.tnt, estimate sym jak values using different search routines, for the groups in tree liebcon.ctf. Since we want special search routines, we will have to use the command line. Make sure first to set: collapse tbr ; <enter> sect: slack 25; <enter> hold 5000; <enter> naked ]; <enter> ttag =; <enter> (so that trees are collapsed strictly) (so that re-packing for sect. searches can be done) (so that enough memory space is available) (so that trees are narrower) (so that labels for support are on the same tree)

12 The command to resample will be: resample replications 100 sym gc from 0 [ search routine ] ; The search routines to try will be: mu1=ho1; mu3=ho1 keep ; mu10=ho1 keep; sort; keep 5 ; xmu = rep 3 nofuse keep drift 5 ; xmu = hits 5 rep 4 fuse 3 drift 10 ; Compare the results of using the different search routines. What is the implication of each search routine? 6 Estimate group supports (using resampling) for the data in zilla.tnt. How would you proceed to do that? 7 Read the data in example.tnt, find most parsimonious trees, and run the macro bremer.run. Compare the values with those obtained in exercise (1). 8 A common measure is the "partitioned bremer support" or PBS, which calculates values of "support" for each data partition, in such a way that they sum up to the total bremer support (on the combined data set). The data set badps.tnt is an example where the one partition that supports a group (group DE) has a negative PBS, and the one partition that contradicts it has a positive PBS. To run it, read the data set, and then type pbs at the command line (or open file pbs.run).

13

Searches for large and difficult data sets. Quick consensus estimation.
1 Read the data set from zilla.tnt. Minimum length is 16218. Set the random seed to time (rseed 0, Settings/RandomSeed), and then run several times each of the following strategies: a) keep 0 ; mult = rep 30 hold 1 keep ; b) keep 0 ; mult = rep 30 hold 1 keep ; tfuse ; c) keep 0 ; xmu = rep 6 keep nofuse nodrif rss xss giveup 16218 ; d) keep 0 ; xmu = rep 6 keep nofuse nodrif rss xss giveup 16218 ; tfuse; Compare average resulting lengths for each of the routines, given that they use roughly similar times. Observe the drammatic difference between (a) and (b). 2 For data set in hel.tnt, try (once) each of the strategies in exercise 1 (replacing 16218 by 23005, the minimum length for this data set). Probably you will be far away from the minimum length. Try also: report +/1; keep 0; mu1=ho1; drift = iter 30 nums 250 fitd 3 rfitd N ; where N is 0.05, 0.15, and 1.0. Here, the two most important changes from drifting defaults are acceptable fit difference (3 instead of 1), and up to 250 substitutions instead of about 100. Observe how (when N increases) the perturbation phase tends to produce longer trees. 3 For hel.tnt, set: report +/1/1; sec : xss 15-6+6-3 godrift 300 fuse 2 ; this sets the exclusive sectorial searches to use divisions of the tree between 15 and 6 (i.e. from 854/15=56, to 854/6=142 nodes), and analyze all of them with multiple RAS, fusing the results two times (the godrift 300 means that only sectors larger than 300 will be analyzed with tree-drifting, which in this case, is none). Once you have properly set the parameters for the sectorial search, run (with random seed = time): xmu = hits 5 target 23005 rss xss drift 5 fuse 3 ; These are probably the parameters that work best with this data set (perhaps they could be improved by touching the options for tree-drifting as in previous exercise, but they work pretty good anyway). 4 For the data set in soltis.tnt, minimum length is 44163. That is an easy, well structured data set, with only 567 taxa. Minimum length is easily achieved with almost default parameters (only xss added): rseed 0 ; report +/1/1 ; xmu = hits 5 xss target 44163 ;

14 5 Run PaupRat on hel.tnt. If you're not familiar with PaupRat, you can use simply the batch file provided in ..\dsets, pooprat.bat. To run this, you have to open a DOS shell, go to the ..\dsets directory, and then type pooprat N <enter> where N is the number of characters in your nexus file. Then, you enter PAUP*, read your data set, and read the file ratchet.nex, with the instructions created by PaupRat. This will do searches, and for each, it will report status every 15 secs. Compare the results with those in exercise 3. 6 If you want an additional test, to show differences for larger data sets, run the data set from randrex.tnt with TNT; set random seed to time (=0), do a random addition sequence saving a single tree (mult = repl 1 hold 1;), and let the ratchet run for some amount of time (say, 5 minutes), using default parameters (but set ratchet = iter 100 or ratchet = iter 200). Record the best length found, and number of rearrangements examined. Then export it as Nexus (export filename, or Data/Export), and try to see how long it takes (or would take) PAUP* to find the same length, or to perform the same number of rearrangements, as TNT. 7 For randrex.tnt, estimate the consensus, using qnelsen (with [mult=rep 1 hold1;]) or Analyze/EstimateConsensus. Try using the resulting tree as constraints for an xmult search: force = &0 ; constraint= ; xmult = repl 6 xss drif 6 ; Under constraints, the search proceeds significantly faster (note number of rearrangements/second). 8 The file zillacon.ctf contains the true consensus of minimum length trees for zilla (in zilla.tnt). 1. First, estimate the consensus, as in the previous exercise. Use different levels of estimation (for both precision, and accuracy). Compare the results of the estimation with those in zillacon.ctf, counting both how number of true nodes recovered, and spurious nodes found. 2. Then, do several searches, finding minimum length only once. Compare again with the results in zillacon.ctf. Count the number of mistaken groups in every case, using different methods for condensing the tree. 9 Calculate a stable consensus for zilla.tnt; stabilize 2 times, with factor 75 (i.e. check stability at hits N + 0.75 N, where N=number of previous hits, starting at 5 hits). Compare results with zillacon.ctf.

15

Introduction (Solutions)
1 - Windows: a) select File/Output/OpenOutputFile and read example.tnt in. b) select Analyze/TraditionalSearch and accept all defaults. c) select File/Output/OpenOutputFile and create trees.out d) select Trees/DisplaySave, and then select trees. You can either (1) double-click on 0, 4 and 6, or (2) select 0, 4, and 6 (holding the ctrl key down and clicking on each of them) and then clicking on the arrow pointing to the right. e) select Format and make sure that Optional table formats is not checked. f) select Optimize/TreeLengths and then all trees (the default). This is equivalent to the treeruler tool. g) select Format and check Optional table formats. h) select Optimize/CharacterScores, and then select trees (3, 4) and select chars. (10-20). Others: proc example.tnt ; mult ; log trees.out ; tplot 0 4 6 ; table - ; length ; table = ; cscores 3 4 / 10.20 ;

2 Create a tree-file in compact notation (File/TreeSaveFile/OpenCompactMode), called example.ctf. Save trees to that file (File/TreeSaveFile/SaveTreesToFile), and close it (File/TreeSaveFile/Close TreeFile). Create another tree-file, in parenthetical notation (File/TreeSaveFile/OpenParenthetical), called example.tre. Save trees, using taxon numbers (set with Format/UseTaxonNames), and close the file. Then create a third file in parenthetical notation, called taxnames.tre, and save the trees, but using taxon names. Exit the program, enter again, and re-read the trees from each of the files; confirm that the trees are identical (can be done with Trees/TreeBuffer/Filter, with defaults, which simply discards duplicate trees, or Trees/TreeBuffer/CompareTrees, which provides a list of non-unique trees). What is the difference in size for the files example.ctf and example.tre? When is it advisable to save the trees using taxon names, instead of numbers? The difference in size between the three tree files should be: 802 bytes for the compact 4187 bytes for parenthetical with taxon numbers 10187 bytes for parenthetical with taxon names. So, the compact is 1/4 of the size of the parenthetical with numbers, using much less disk-space.

16 Taxon names should be used when the sequence of the taxa in the matrix might (for any reason) be changed in the future, which invalidates any tree file based on taxon numbers (as both compact and parenthetical with numbers).

3, 4 Should present no problems.

5 Contents of file instructs: log automatic.out ; p example.tnt ; mult ; nelsen ; log & automatic.emf ; nelsen ; log /& ; length ; quit ; Contents of automatic.bat: tnt instructs ; Contents of automatic: tnt p instructs , /* if this is not in current directory, give path! */ /* in Linux/Mac, omit this line and the next two */

6 Windows: First select Settings/Batch/SetMenusToBatch, and then: h) i) j) k) l) open a log file, automatic.out File/Output/OpenOutputFile read the data from example.tnt File/OpenInputFile calculate most parsimonious tree(s) Analyze/TraditionalSearch save consensus of all trees Trees/Consensus (Windows only) open a metafile, automatic.emf, and save the consensus to it File/Output/OpenMetafile, Trees/Consensus, File/Output/CloseMetafile m) calculate length of all trees found Optimize/TreeLength n) exit the program type quit [enter] at the command line Then select Settings/Batch/SetMenusToNormal and Settings/Batch/SaveActionsToFile. Quit the program.

17 Now create (using Notepad, WordPad, or any other text editor) a file, autobatch.bat, with the following contents: tnt ; p : instructs.bmn ;

7 Non-windows: create a file A with the following contents (you can cut-and-paste):
ccode 22 26 92 + 42 34 ; proc/;

create a file B with the following contents:

cstree 101 = 0 / 2-3-1 \ 4 ; proc/;

Then, reading file A will do the first part of the exercise, and reading file B the second. In Windows versions, you can paste the tree itself at the text-dialog which appears under Character State tree in Data/CharacterSettings. Make sure you select the characters to which you want to apply the additivities or character-state tree, by clicking on the CHARS. select button.

8 type, at the command line: log myhelp.txt ; help; help*; log/;

9, 10 should present no problems

18

Optimization (Solutions)
1 - Windows: a) make sure that Map characters in color and Preview trees (under Format) are checked, then select Optimize/Characters/CharacterMapping. b) make sure that Map characters in color (under Format) is not checked, then select Optimize/ Characters/CharacterMapping, and mark Use character names off. c) as in (b), but check Use character names. In each of these cases, pressing S will write tree-diagram (in ASCII characters) to the text buffer/output file. Pressing H will give you more choices, and pressing ESC will take you out of the pre-view screen. Commands: b) cnames-; map ; c) cnames=; map ;

2 Follow the steps of 1a (selecting, by double clicking on it, the character male_spur), then press F2F1 to make branches thicker or thinner, and M to create metafile (press H for help on more options).

3 You can use Data/ListCharacterNames to see how states are named (under Windows), or cname 78; or cname male_spur; (with commands). For character 78, the states are: theraphosoid, diplura, acanthogonatus, chaco, abs, proventral. Obviously, all but abs are some kind of presence. Thus, losses are changes from any state to abs, and gains from abs to any other state. To see the losses under Windows, select Optimize/CountSpecificChanges. Then type the name of every state but abs in the left box (alternatively, you can type ?, which means every possible state), and type abs in the right box. Then hit enter (the three different ways to display results may be more convenient depending on what you want to see). Gains are done in the opposite way (abs in the left box, ? in the right). To do this with command, you just type change / male_spur/ ? abs; for losses, and change / male_spur/ abs ?; for gains.

4 The exact number of reconstructions depends on the tree, and this in turns depends on the random see you use. The reconstructions can be found with Optimize/Character/reconstructions (make sure that Format/PreViewTrees is off, so that all tree-diagrams go straight to the text buffer), or with recons. What can be said is that the character is uninformative: the minimum possible steps for this character

19 on any tree is 9 steps (e.g. 0123456789), but that is also the maximum no tree requires more than 9 steps. Any tree you generate will have the same number of steps, achievable through many different paths, e.g.:
0 a 0 1 b 0 2 c 0 3 d 0 4 e 0 5 f 0 6 g 0 7 h 0 9 j 08 I

which implies 9 steps (from 0 to each state independently), or :


0 a 0 1 b 1 2 c 2 3 d 3 4 e 4 5 f 5 6 g 6 7 h 7 9 j 88 I

5 The synapomorphies common to all trees are found with Optimize/Synapomorphies/MapCommon Synapomorphies, or with apo[;. They are in characters 7, 30, 31, 40 and 70. Characters 22, 45, 46, 64, 85, and 102 appear as synapomorphies if the strict consensus itself is mapped (saving the strict consensus to RAM, under Trees/Consensus, or with nelsen*;, then optimizing it note that length of the consensus itself is 481 instead of 382 steps!), but they appear as synapomorphies in only some of the individual trees (and not all of them). Here is char. 102 in one of the trees:
0 Flamenco 0 Chil_553 0 0 Chil_puert 0 00 Chil_calde ?=01 D_bonarien 0 01?=01 D_ornatus 1 L_tofo 01 1 1 L_quilicur 1?=1 L_paposo 1 1 L_domeyko 1 2 L_longipes 1 1 L_epipiptu 1 1 L_gajardoi 1 1 L_frayjorg 11 L_caldera

Since the character is not known in D_ornatus and D_bonariensis, and Lycinus has a state different from Chilelopsis and Flamencopsis, then there is two possibilities: either 102 is a synapomorphy of Lycinus+Diplothelopsis, or the character 102 is a synapomorphy of Lycinus only. There is no way to know, given this tree and the observations. The character 102 is optimized as a synapomorphy of Lycinus+Diplothelopsis without ambiguity in the strict consensus, but the conclusion is not warranted if we look at the implications of individual trees.

20 Moral of the story: DO NOT OPTIMIZE CONSENSUS TRESS!! Optimize the individual most parsimonious trees instead...

6 More of the same... Here is char. 22 on the strict consensus (with the node of interest marked with an arrow; the character requires 14 steps on this tree):
1 Flamenco 1 Chil_553 1 2 Chil_puert 1 22 Chil_calde 0 L_domeyko 0 L_epipiptu 1 1 L_frayjorg 1 L_caldera 0 L_longipes 00 L_gajardoi 0 D_bonarien 00 D_ornatus 0 L_quilicur 00 L_paposo 0 L_tofo

Here is char. 22 in one of the possible resolutions of the polytomy:


1 Flamenco 1 Chil_553 1 1 2 Chil_puert 22 Chil_calde 1 1 L_frayjorg 0 D_bonarien 1 00 D_ornatus 0 L_tofo 0 0 0 L_quilicur 00 L_paposo 0 0 L_domeyko 0 0 L_longipes 0 0 L_epipiptu 0 1 L_caldera 00 L_gajardoi

In this resolution (which, recall, has a total length of only 382 steps, with 13 steps instead of 14 for character 22!), Lycinus_frayjorge is the sister group of all other Lycinus, so that the change from 1 0 is a synapomorphy of all Lycinus other than L. frayjorge. Moral of the story: ditto for number 5!

7 Windows: Trees/RandomTrees Trees/TreeBuffer/SortTrees Optimize/TreeLength

21 Non-windows: rand 10; sort; length ; The trees are much longer, when compared to a shortest tree (well over 1000 steps, compared to 382). Moral of the story: you are unlikely to find a shortest tree by generating trees at random!

8 The definition of tree-groups from the menus should present no special problems. Using commands, the definitions would be: 1. tgroup =0 (shortest) len=382; 2. tgroup =1 (medium) len=384; 3. tgroup =3 (longer) len=385; 4. tgroup =4 (random) len>385; 5. tgroup =5 (notsobad) [1] [2]; Note that the last one includes in group 5 the trees that are in either group 1 or group 2 (i.e. if you interpret the question as including the trees from group 1 and the trees from group 2). If you want to include the trees that simultaneously belong to group 0 AND group 1 (none, in this case, which is an alternative interpretation of the question posed), you have to use instead: 5b. tgroup =5 (notsobad) [ 1 2 ] ; Moral of the story: if they are not ripe, don't bother catching the grapes, the fox said.

9 - This should present no problems.

22

Tree Searches (Solutions)


1 As you proceed to add more taxa, the exact search becomes orders of magnitude slower, but the heuristic search continues being very fast. In every one of the cases, the heuristic solution is finding the same length independently the vast majority of the replications (i.e. starting points), thus increasing confidence that the heuristic solution is indeed effective at finding shortest trees. For 27-29 taxa, here are the times for exact and heuristic searches:
27 active, 57 inactive taxa Implicit enumeration, 40 trees found, score 84. 20.39 secs. Repl. Algor. Tree Score Best Score Time Rearrangs. 100 TBR 50 of 50 -----84 0:00:00 1,281,770 Best score hit 99 times out of 100 (some replications overflowed). Best score (TBR): 84. 40 trees retained. 0.09 secs. 28 active, 56 inactive taxa Implicit enumeration, 24 trees found, score 86. 29.77 secs. Repl. Algor. Tree Score Best Score Time Rearrangs. 100 SPR 24 of 24 -----86 0:00:00 1,138,249 Best score hit 100 times out of 100 (some replications overflowed). Best score (TBR): 86. 24 trees retained. 0.08 secs. 29 active, 55 inactive taxa Implicit enumeration, 60 trees found, score 109. 221.85 secs. Repl. Algor. Tree Score Best Score Time Rearrangs. 100 SPR 60 of 60 -----109 0:00:00 2,515,332 Best score hit 100 times out of 100 (some replications overflowed). Best score (TBR): 109. 60 trees retained. 0.11 secs.

In each of the cases, the heuristic search produced exactly the same trees as the exact search (even without global branch-swapping after the multiple random addition sequences), in the case of 29 taxa, 2000 times faster. For each of the taxon subsets, the same length was found independently in 99-100% of the pseudoreplicates, showing that the TBR algorithm easily finds a tree of minimum length for the data set, regardless of the starting point. Moral of the story: learn to trust the results of heuristic searches, when the minimum length can be hit repeatedly.

2 Demonstrated in class.

3 The command mult (or Analyze/TraditionalSearch with defaults), produces:

23
Repl. Algor. Tree Score Best Score 10 TBR 42 of 42 -----227 Completed 10 random addition sequences. Total rearrangements examined: 1,932,043. Best score hit 7 times out of 10. Best score (TBR): 227. 12 trees retained. Time 0:00:00 Rearrangs. 1,932,043

Thus, the length of the shortest trees is 227 steps; minimum length was found 7 out of 10 times, so it is unlikely that there are shorter trees (if we wanted to be more certain, we could do another set of 10 replications, changing the random seed, but that is not necessary for this exercise). From the report produced by the program, we cannot know the number of islands; there could be one of 12 trees, or 7 different islands. To begin exploring this, we can select one of these 12 trees (e.g. by selecting tree 0 with Trees/Tree Buffer/SelectTrees, or with tchoose 0;) and start swapping from it (by choosing Trees from RAM as Starting trees under Analyze/TraditionalSearch, or with bbreak;). This produces as output:
Start swapping from 1 trees (score 227)... Repl. Algor. Tree Score Best Score --TBR 5 of 6 -----227 Completed TBR branch-swapping. Total rearrangements examined: 148,974. Best score (TBR): 227. 6 trees found. Time 0:00:00 Rearrangs. 148,974

Since these 6 trees were found from a single starting point (tree 0), then it follows that they all belong to the same island. What about the other 6 trees? If you do a search, with different random seeds (recall that 0 is the time), using a single random addition sequence and making sure you don't throw away previous trees (i.e. making sure that Replace existing trees is not checked in the dialog for Analyze/TraditionalSearch), you sometimes find no new trees (beyond these 6), which means that you have landed on the same island the search produced one of these 6 trees. If you repeat searches, you eventually find 6 new trees:
Repl. Algor. Tree Score Best Score 1 TBR 11 of 12 -----227 Completed 1 random addition sequences. Total rearrangements examined: 288,370. Best score hit 1 times out of 1. Best score (TBR): 227. 12 trees retained. Time 0:00:00 Rearrangs. 288,370

The new trees (coming from a single starting point: you did just one addition sequence) must belong to a single island; so we know that the first set of 6 trees is in an island, the second set in another, and that's the 12 trees there are. So, there are TWO islands of 6 trees each, for this data set. Note that, as you repeat searches with a different random seed, sometimes the single random addition sequence may fail to find trees of minimum length, so that you end up with more than 12 trees:
Repl. Algor. Tree Score Best Score Time 1 TBR 15 of 16 -----228 0:00:00 Completed 1 random addition sequences. Total rearrangements examined: 346,226. Best score hit 1 times out of 1 (some replications overflowed). Best score (TBR): 228. 16 trees retained. Rearrangs. 346,226

24 In this case, you have to get rid of the longer trees, retaining only the short ones (e.g. tchoose 0.5;), and continue trying new random addition sequences plus branch-swapping. Constrained searches.- The constrained searches are done by first defining the constraints (force command, or Data/DefineConstraints), and then enforcing them (with constrain=, or checking on the Enforce constraints option of the search dialogs). The difference in tree length when constraining the three groups at the same time is because the best tree(s) where Ummidia+Calathotar+Heteromigas +Actinopus+Plesiolena+Idiops+Neocteniza+Misbolas do not form a monophyletic group may however have Stenoteromm+Acanthogona formin a monophyletic group, and viceversa in this way, the length differences to make non-monophyletic the three groups may well be additive (although they don't need to!). Thus, if you want to test the hypothesis of monophyly of specific groups, you have to constrain them one by one.

4 Under that collapsing rule, the data set produces 23,328 distinct trees, in a single island. The best strategy would be making sure you have hit the minimum length a few times, then doing global branch swapping (with Analyze/TraditionalSearch, selecting trees from RAM as Starting trees, or bbreak;). However, the consensus will be identical if instead of finding those large numbers of trees, the minimum length (=382) is hit a fair number of times and the resulting trees are consensed. In addition, collapsing under rule 1 (=more strictly), the 72 distinct trees produce the same consensus. So, the best strategy is ignoring the premise for this exercise...

5 - It is not always true that as one saves more trees per replication, one is more likely to find all equally parsimonious trees; there are some exceptions to this rule. In the case shown in the exercise, there is a total of 864 distinct trees; they all belong to a single island. If you run 20 replicates saving up to 216 trees/replicate, the first replicate will probably find minimum length, and 216 trees; then, 864 216 = 648 trees, or 75% of the trees, remain to be found. The next replicate which finds a tree of minimum length is more likely to land in one of the unfound trees than in one of the trees already found, so that another 216 trees are found in the next few replicates. Now, we have 216 x 2 = 432 trees, and 50% of the trees remain to be found. Again, landing in one of the unfound trees is quite likely, producing 432 + 216 = 648, and exactly 216 (or 25%) of the trees) remain to be found at this points, we will probably be at the 6th or 8th replicate. One of the remaining 12 to 14 replicates to be done will almost certainly land in that 1/4 of the tree space. The second case, saving up to 430 trees/replicate, will find 430 trees in the first hit to minimum length, and another 430 , for a total of 860, in the next few replicates. Now, only 4 trees, or 4 / 864 = 0.046% remains to be found. It is rather unlikely that one of the remaining replicates will land in one of those 4 trees. Those 4 trees, therefore, are never found in the initial set of replications. Note that if global TBR is performed after the random addition sequences, the 864 trees are always easily found. The moral of the story is that you should not think of the multiple random addition sequences as a method by which to find all equally parsimonious trees; it is instead a method by which to produce hits to minimum length which are independent (and thus, likely to belong to different islands). If you want many equally parsimonious trees, you then submit the trees resulting from multiple random addition sequences to global TBR. This particular behaviour is in fact caused by the design of TNT: when TNT is doing multiple

25 random addition sequences, if it finds a tree which had been found (and presumably, swapped) before, then the pseudoreplicate is abandoned. Of course, the tree could be retained and swapped, thus producing additional trees which (because of the tree-buffer size) had not been saved by swapping on that tree before. However, swapping on that tree again can never produce a shorter tree (it didn't before!), so time is better spent moving onto the next replication than staying in this one just in case the previously found tree will produce some trees that were not stored before.

6 The matrix has 8 taxa and produces 10,395 most equally parsimonious trees; the number of possible trees for 8 taxa is precisely 10,395. Which is to say, each one of the (binary) trees for this matrix has exactly the same length the matrix has so much character conflict, so evenly distributed, that no tree is better than any other tree. Generating a random tree is enough for finding a tree of minimum length for this data set. In such a case, the exact solution will travel the paths leading to each of 10,395 trees in an orderly fashion, producing a total of 10,395 complete trees. In other words, no tree is looked at twice. The heuristic solution, instead, will swap on the first tree, doing 120 to 170 rearrangements. But the same number of rearrangements will be produced by swapping on each of the trees, so that each tree is actually being found (on average) 165 times, with the consequence that to complete swapping- the global TBR will have to examine about 1.7 million trees, instead of only 10,395. At the end of the swapping process, when many trees have been found already (and only a few trees remain to be found), the vast majority of the rearrangements attempted are redundant, producing one of the trees that had been found already; each of these rearrangements has to be compared to each of the (thousands) of pre-existing trees, to finally realize that it was identical to one of them and discard it. This hardly means that heuristic searches will usually take longer than exact searches, it only serves to illustrate the different mechanics of the two methods.

7 There are two reasons for that. The main one is that when saving all possible trees, for every rearrangement done, the resulting tree has to be compared to the existing trees this implies that the branches unsupported by synapomorphies have to be identified (which requires a complete optimiztion of the tree), and the tree compared to the pre-existing ones, in the vast majority of the cases, only to be discarded. The tree-collapsing and comparing takes time, and it occurs for the full time of the search. When only 1000 trees are being saved, as soon as the memory buffer is filled (which happens after swapping the first 15-20 trees), then this extra work is no longer required; the only work needed is to calculate tree-lengths, not to collapse and compare trees, and this is true for the majority of the trees to be swapped during the search. The second reason is that, after the tree-buffer has been filled (which never happens when saving all trees, and happens quickly when saving only 1000 trees), a rearrangement can be discarded as soon as the program realizes it is as long as the best tree fond so far, so that tree-length calculations can be given up a little faster (if additional trees are to be saved, and TNT knows that the rearrangement will produce a tree of the same number of steps when having looked at half the characters, it still has to look at the other half, because trees which are as good must also be saved).

8 - Here is a graph of the strategies A, B, and C, showing the frequency with which different lengths (or steps beyond minimum) are found for each of the strategies (for 1000 cycles of each). For strategy A, it is almost certain that a 3-minute search will produce trees that are no more than 7 steps away from

26 minimum length. For strategy B instead, the results are much more dispersed, and only about half the searches will produce trees up to 7 steps away from minimum. Strategy C (without collapsing the trees) is even slightly inferior to strategy B (with frequencies shifted to longer trees), because it means that TNT swaps in trees that are even more similar than the trees found with strategy B (and thus less likely to lead to better trees). frequency

Steps beyond minimum

9 - Here is a log (every half a minute) of the best lengths (Best score column) for the first 20 minutes of multiple random addition sequences saving up to 5 trees per replication:
Repl. 6 9 14 18 20 23 28 30 32 36 39 Algor. TBR TBR TBR TBR TBR TBR TBR TBR TBR TBR TBR Tree 25 of 40 of 65 of 85 of 96 of 111 of 135 of 146 of 158 of 175 of 193 of 30 45 70 90 100 115 140 150 160 179 195 Score 23052 23049 23038 23059 23043 23051 23036 23034 23036 23059 23051 Best Score 23033 23033 23033 23033 23033 23033 23033 23029 23029 23029 23029 Time 0:00:30 0:01:00 0:01:30 0:02:00 0:02:30 0:03:00 0:03:30 0:04:00 0:04:30 0:05:00 0:05:30 Rearrangs. 1,600,514,817 3,295,329,996 4,850,993,492 6,662,048,923 8,372,833,054 10,027,735,468 11,620,013,259 13,346,557,747 15,090,293,516 16,626,658,318 18,267,963,498

27
44 46 50 54 59 63 65 71 76 80 84 89 94 98 101 105 110 113 116 121 125 129 134 138 143 146 151 154 157 TBR TBR TBR TBR TBR TBR TBR SPR TBR TBR TBR TBR TBR TBR TBR TBR TBR TBR TBR TBR TBR TBR TBR TBR TBR TBR SPR TBR TBR 215 225 247 266 290 313 324 350 375 398 417 440 465 485 501 523 545 560 576 600 624 640 665 686 711 727 750 765 780 of of of of of of of of of of of of of of of of of of of of of of of of of of of of of 220 230 250 270 295 315 325 350 380 400 420 445 470 490 505 525 550 564 580 602 625 645 670 690 715 730 750 770 785 23048 23051 23044 23034 23039 23029 23035 23058 23037 23025 23059 23047 23046 23041 23042 23066 23042 23027 23037 23043 23043 23036 23051 23054 23043 23039 23083 23053 23038 23029 23029 23026 23026 23026 23026 23026 23026 23026 23025 23025 23025 23025 23025 23025 23025 23025 23025 23024 23024 23024 23024 23024 23024 23024 23024 23024 23024 23024 0:06:00 0:06:30 0:07:00 0:07:30 0:08:00 0:08:30 0:09:00 0:09:30 0:10:00 0:10:30 0:11:00 0:11:30 0:12:00 0:12:30 0:13:00 0:13:30 0:14:00 0:14:30 0:15:00 0:15:30 0:16:00 0:16:30 0:17:00 0:17:30 0:18:00 0:18:30 0:19:00 0:19:30 0:20:00 19,827,046,318 21,415,859,960 23,032,628,560 24,771,414,265 26,363,892,662 28,132,977,702 29,867,669,390 31,504,029,844 33,312,063,848 34,932,875,864 36,621,536,621 38,211,254,146 39,738,504,629 41,499,373,284 43,069,505,875 44,699,873,000 46,267,437,176 48,216,956,111 50,015,661,126 51,674,461,305 53,449,424,685 55,116,844,830 56,911,973,007 58,609,144,608 60,302,726,240 61,898,427,378 63,490,875,144 65,170,149,625 67,035,307,031

This type of log can be produced by selecting Settings/ReportLevels/ReportProgress and changing Report status every ... seconds (alternatively, with the command report+30;). The log shows that the best length obtained (23024) after 20 minutes of searching is still well above the minimum possible for this data set (23005), despite having examined 67 billion rearrangements.

28

Consensus and ambiguity (Solutions)

1 Under rule 1 there are 72 equally parsimonious trees, the consensus of which has 60 nodes. Under rule 3, there are 23,328 most parsimonious trees; trying to find them all is annoying, but if you do, you can verify that the strict consensus of the 23,328 trees is exactly the same as the strict consensus tree of the 72 trees found under rule 1. The consensus is also the same as the strict consensus for the millions of most parsimonious binary trees. This equivalence is because a branch which has zero-length under some most parsimonious reconstruction can be collapsed without increasing length; this also implies that each of the three alternative resolutions of the trichotomy will be of minimum length. The consensus of the three binary trees will then be identical to the polytomous (collapsed) tree.

2 When you find 10 trees of minimum length distinct under rule 1, these trees are retained by TNT as binary the trees are compared to make sure they would be different if the zero-length branches are collapsed, but the branches themselves are not collapsed. The zero-length branches are then temporarily collapsed during consensus calculation. The 10 trees found are, most likely, sufficient to produce the correct consensus, which has 60 nodes (this of course depends on the random seed you used). When you take those 10 trees and collapse them temporarily (for consensus calculation) using rule 3 instead of rule 1, you are using a less stringent criterion. Many branches that were collapsed under rule 1, survive (so to speak) rule 3. Thus, the strict consensus of those 10 trees collapsed under rule 3 is likely to contain more nodes than the strict consensus (probably 63-65 nodes). These extra nodes are nodes that are not truly supported by the data; had we found all the trees distinct under rule 3 instead of only 10, those nodes would not have been present in the consensus. When you take the 10 trees, and consense them as they are (i.e. binary), the same thing happens, except that even more nodes survive. You will probably get consensus trees with 69-71 nodes. Which result should be reported? All the nodes beyond the initial 60 are nodes that are not supported by the data i.e. groups that can be absent from trees which are of minimum length, this is, nodes which we do not need to postulate in order to have the most explanatory phylogeny. The correct result is then a consensus of 60 nodes. The point of the example is to show that if only a few trees are to be saved (and many data sets which produce large numbers of trees may require that we save only a small portion of the possible equally parsimonious trees), then it is better to collapse those trees more strictly i.e. we are less likely to erroneously postulate that some group is supported by the data if we collapse the trees more strictly. Keep in mind that the 10 original trees were distinct under rule 1; if the criterion to consider each of the 10 as distinct had been rule 3 instead, the situation could be worse, because the trees would be even more alike. The last part of the exercise shows that, when collapsing with TBR, even finding a single tree of minimum length is sufficient to produce the correct consensus, with only 60 nodes.

3 - The results we get for node 173 are:

29

Node 173 (12-tomy), 5 node(s) gained pruning Plmagnus (Node 179) ?? Plmagnus Colmulti (Node 189) (Node 180) (Node 177) Collafer (Node 192) (Node 191) Plkaindi (Node 172) Colopaci

(the command to obtain this is simply prunnelsen [173]=2;, which processes node 173 cutting up to two branches connected to the node in the consensus; the same can be obtained from the menus). And the results we get for node 169 are:
4 node(s) gained pruning 195, Mafrigid Disulcip ?? (Node 195) ?? Mafrigid (Node 257) (Node 211) NsKaumon (Node 256) (Node 226) NsDimoss Dierythr Diaterri NsDmunro Dimunroi

So, the consensus must be calculated excluding Plmagnus, Mafrigid, and the taxa that belong to node 195 in the consensus. You can look at them one by one, or you can just create a taxon group (for which we choose the name floaters) with all these taxa nelsen*; agroup =0 (floaters) Plmagnus Mafrigid @24 195; keep 24; Note that the first command will add the consensus as the 25th tree (tree number 24). That's why the last command (keep 24) retains only the first 24 trees to make sure that the consensus itself is discarded after having been used as reference for creating the taxon groups. Once you created the groups, you can calculate the consensus excluding the taxa in floaters (note the double slash): nelsen / / { floaters } ; This produces (for the corresponding parts of the tree):

30
Colxanth Placumin Colpecko Colmulti a Notangul Notexter Viviolac Colpiceu a Collafer Plaroysi Colnigra Plkaindi a Colopaci Colbrunn Collatus Coleremi Colcasta Colmonti Coltrunc Colpacif Collaetu Colhopki Colhabil Coleryth Colcyane Colbuxto Colbucha

and
Chcostat Chmoloka Chcorrus Disulcip NsDiangl c NsDimons Atelaaae Dicurtip NsAtrapo Atrashar Atrakoeb Atraperk NsKaumon NsDimoss Dierythr c Diaterri b NsDmunro bc Dimunroi c>c Dermican NsBropta Broptatu Anagonoi b Dilongip Difractu NsDfract

31 A letter with the sign greater than indicates that the corresponding node appears in some trees not on the branch itself, but instead as a polytomy. The legends at the bottom indicate the meaning of the letters a, b, c: a: Plmagnus (31) b: Mafrigid (158) c: node 195 of consensus The numbers of possible resolutions for nodes 173 and 196 of the consensus are, respectively, 1.37x10 (i.e. a 12-chotomy) and 3.16x10 (i.e. a 13-chotomy). The number of actually different resolutions (found with the resols command, or with Trees/Comparisons/ShowResolutions) is 7 for node 173 and 10 for node 196.
10 11

4 - There are 12 groups present in tree 0 that are not present in tree 1. There are 10 groups in tree 1 that are not present in tree 0. These two numbers don't need to be the same (they will be the same only when comparing completely resolved trees; these trees have some polytomies because we are using temporary collapsing under rule 1).

5 The agreement subtree includes 36 taxa. The heuristic algorithms of TNT calculate 34 taxa in the agreement subtree.

6 - The agreement subtree has 25 taxa (there are actually 6 different subsets of 25 taxa resolved identically in all input trees). That is half the taxa, and it would indicate that the trees are rather different. The interpretation is not correct, however, because the trees differ only in the resolution of one polytomy; the strict consensus has 45 nodes. The SPR-distance between the two trees is only 2 moves. For these two trees, both the number of nodes in the strict consensus and the SPR-distances provide a better assessment of the similarity of the trees than the number of taxa in the agreement subtree.

7 The consensus is completely unresolved for option (1), and perfectly resolved for options (2) and (3). The difference is because, when a taxon has a missing entry in a character, this always creates ambiguity in the possible synapomorphies for the branches above and below the point where the taxon is inserted in the rest of the tree. Thus, if X has a missing entry for the character that would provide (in the tree below) a synapomorphy for the group CD:
0 A 0 B ? X 1 C 1 D

32 this creates an ambiguity in the optimization of the character which makes the character ambiguous:
0 A 0 0 B 0 ? X 01 1 C 1 1 D

This means that condensing the tree under rule 1 (which eliminates ambiguously supported groups) would produce:
A B X C D

because both CD and XCD are ambiguous groups. Subsequently removing X from this tree still leaves BCD in a polytomy! This is what happens under option (1). When collapsing trees temporarily, this does not happen, because TNT takes care of the situation, by noting that the node below X (which is to disappear from the tree, together with X itself) is ambiguous, but the nodes above and below it are not, thus making the group above X unambiguously supported. So, this is equivalent to creating 3 binary trees for each ambiguosly supported branch (including X), then removing X, and consensing.
A B X C D A B X C D

A B X A X B A X B C D

C D

C D

A B X A B C A B D

C D

D X

C X

When X is removed from those 6 trees (two of which are identical), we obtain:

33
A B C D

which is the correct result. In the case of (3), since ambiguously supported branches are retained, condensing the tree:
0 A 0 0 B 0 ? X 01 1 C 1 1 D

produces exactly the same tree. Temporary collapsing in that case can be done without the special precaution described for the case of rule 1.

8 The majority rule consensus tree of the 31 equally parsimonious (only 29 if zero-length branches are collapsed), is:
Root A B 100 C 93 D 87 E 80 F 74 G 67 X 61 H 54 I 54 J 61 K 67 L 74 M 80 N 87 P 93 O

X is in the middle of the tree. Note the symmetric way in which frequencies for the groups change; that is because there are fewer positions of X (2/31 0.07) which make PO non-monophyletic, than there are positions of X which make group NPO (4/31 0.13). The same is true for the groups that exclude Root+A, and Root+AB. In the majority rule consensus, X is in the middle of the tree, yet we know nothing about X (it has only missing entries); the conclusion regarding the position if X is certainly not a sensible one.

9 If you look carefully at trees 1-3, you will notice that these trees can be combined as:

34
Diplurines A_alegre A_incursa A_guttulat A_pissii A_huaquen A_quilocura A_franckii

because A_huaquen is absent from tree 3 but is closer to A_franckii than it is to A_pissii in tree 2, and A_quilocura is absent from tree 3 but closer to A_franckii in tree 1. Since there is no way to know which of A_quilocura, A_franckii, and A_huaquen are closer to each other, these three taxa form a polytomy. Combining tree 0 with the others produces a full polytomy. Note that tree 0:
Diplurines A_juncal A_huaquen A_campanae A_guttulat A_franckii A_alegre

displays A_alegre as closer to A_franckii than to A_guttulata; this contradicts each of the groups in the supertree shown above. The same is true of A_huaquen, in tree 0 outside of the group which includes A_franckii, A_guttuluata, and A_alegre.

10 A total of 4 SPR moves are necessary to interconvert trees 0 and 1. Since there are 84 taxa, only 4 moves is a small number; thus, according to this criterion, the trees are very similar.

35

Character Weighting (Solutions)

1 The easiest way to find the consensus trees for concavities 5-8 is by a mini-script, typing at the command line: tsa*alltrees.tre; macro=; loop 5 8 piwe = #1; mu20; bb; ne*; save {strict}; stop This will save to a file alltrees.tre the strict consensus for each concavity. Then, closing the file (and discarding the trees found before), it is possible to calculate the consensus of consensi (making sure that the consensi themselves are not temporarily collapsed): tsav/; keep 0 ; coll notemp ; p alltrees.tre ; ne ; with this, it can be verified that the strict consensus for all the concavities has 51 nodes, and is relatively well resolved. In particular, it has most of the genera (notably Acanthogonatus, a large genus indicated as A_...) as monophyletic. What this means, in words, is that the conclusion of monophyly of Acanthogonatus (and other genera) does not depend on the specific choice of K value (at least within the range 5 K 8).

2 - Under equal weights, for the full data set, there is a single tree, of length 592 (note that superficial searches may find longer trees; make sure you find the shortest; roughly 2-3% of the random addition sequences saving 10 trees find this length). On this tree, the perfect data have a length of 96 steps; for the perfect data alone, there are trees with 56 steps i.e. the data set including the random characters produces a tree which is very different from the tree produced by the structured data alone. When implied weighting is turned ON, under K=3 (default), the best tree for the full data set has score 43.69068; each of the addition sequences finds this length (showing that tree choice is, under implied weighting, loud and clear). The score for the perfect data onto that tree is 0.00 (no homoplasy) and, obviously, there are no better trees. In other words, adding the random data does not distort the tree preferred by the good data alone. When the weights increase with homoplasy, there is a single tree of score 549.75000; the score for the perfect data on that tree is 40.20000, but (for the perfect data alone), the best tree has score 0 (no homoplasy). This shows that the tree found for the full data, under the function that increases weight with homoplasy produces trees which are far from the optimal trees for the perfect data alone.

3 To define a weighting function, you have to set implied weights ON before reading the data, then read the data and type at the command line the weights for the different numbers of extra steps: piwe[ 1 0 ;

36 this will give a weight of 1 to the first transformation in the character, 0 to all other ones. This is a clique function. The optimal trees under this function have a score of 64.0000 (i.e. 64 characters have homoplasy, and since there are 104 active characters, this means that the clique contains 40 characters that are perfectly compatible and can be free of homoplasy on the same tree). There are many many trees where 40 characters are free of homoplasy, and their strict consensus is poorly resolved (with only 15 nodes). There are 4 additional groups that can be resolved by considering the secondary cliques (i.e. groups supported by characters with just one step of homoplasy); this can be done with a weighting function that gives weight of 100 to no homoplasy, 1 to the first step, and 0 to all the rest: piwe [ 100 1 0 ; under such weighting function, the best trees have a score of 63.52000. Similarly, the groups supported by the tertiary clique could be found by adding a category: piwe [ 10000 100 1 0 ; which adds another 7 nodes to the consensus (for a total of 26 groups). This is still far from the results of a parsimony analysis, even with a sensitivity analysis like the one of exercise 1, which produced 51 nodes in common for 4 different concavities (K 5-8). The results for compatibility analyses were, typically, very poorly resolved. In addition, it is hard to see a rationaly for first ignoring completely many characters (those not in the primary clique), but then using them to resolve some portions of the tree.

4 - The trees produced by the different criteria are shown in the next page, with the characters of interest mapped. In the solution of equal/implied weights, there are two characters (0 and 1) which put together bisca and lutea. Note that the character is mapped on the tree as a reversal (1 0) in bisca+lutea, and that there are many transformations 0 1 in the rest of the tree, but no other 1 0. This is in conflict with another character (2) which would prefer to have instead huaria and lutea together (thus saving one step). In the solution for auto-weighted optimization, characters 0 and 1 are best mapped as independent losses in the successive sister groups of bisca+huaria+lutea (since there are many 0 1 changes elsewhere in the tree). Thus, the two potential synapomorphies of bisca and lutea are no such; they are best viewed as plesiomorphic at this point. Since no steps would be saved for chars. 0 and 1 by joining lutea and bisca, and joining lutea and huaria instead saves steps (some unambiguous transformation 0 1 in char. 2), then that tree is preferred.

37
EQUAL AND IMPLIED WEIGHTS Chars. 0 and 1 0 anyphaeninae 1 coptoprepes 0 1 amauro_b 00 amauro_a 0 0 ferrier_a 0 1 ferrier_c 0 00 ferrier_b 0 gayenn_a 0 0 1 gayenn_c 00 gayenn_b 1 tasata 0 00 liparotoma 0 oxysoma 0 0 oxy_lon 0 1 m_alupura 0 11 m_vittata 1 1 m_dilatic 1 1 m_silvati 1 1 m_pichina 1 1 !!M_HUARIA 1 0 !!M_BISCA 00 !!M_LUTEA Char. 2 0 anyphaeninae 1 coptoprepes 0 0 amauro_b 00 amauro_a 0 0 ferrier_a 0 0 ferrier_c 0 00 ferrier_b 0 gayenn_a 0 0 1 gayenn_c 11 gayenn_b 0 tasata 0 00 liparotoma 0 oxysoma 0 1 oxy_lon 0 0 m_alupura 0 00 m_vittata 0 0 m_dilatic 0 0 m_silvati 0 0 m_pichina 0 1 !!M_HUARIA 01 0 !!M_BISCA 011 !!M_LUTEA AUTO WEIGHTED OPTIMIZATION Chars. 0 and 1 0 anyphaeninae 1 coptoprepes 0 1 amauro_b 00 amauro_a 0 0 ferrier_a 0 1 ferrier_c 0 00 ferrier_b 0 gayenn_a 0 0 1 gayenn_c 00 gayenn_b 1 tasata 0 00 liparotoma 0 oxysoma 0 0 oxy_lon 0 1 m_alupura 0 11 m_vittata 0 1 m_dilatic 0 1 m_silvati 0 1 m_pichina 0 0 !!M_BISCA 0 0 !!M_LUTEA 01 !!M_HUARIA Char. 2 0 anyphaeninae 1 coptoprepes 0 0 amauro_b 00 amauro_a 0 0 ferrier_a 0 0 ferrier_c 0 00 ferrier_b 0 gayenn_a 0 0 1 gayenn_c 11 gayenn_b 0 tasata 0 00 liparotoma 0 oxysoma 0 1 oxy_lon 0 0 m_alupura 0 00 m_vittata 0 0 m_dilatic 0 0 m_silvati 0 0 m_pichina 0 0 !!M_BISCA 0 1 !!M_LUTEA 11 !!M_HUARIA

5 The supwt.run script can be downloaded from the scripts subdirectory of the TNT web page. To run it with the specified options, you have to copy it to the same directory where you run TNT, read the data set into TNT, type macro=; at the command line, and then type supwt 15 6;. If you add + as a third argument ( supwt 15 6 +;) then instead of re-weighting with the number of supported branches at which the character changes, the script re-weights characters based simply on the homplasy. The difference in results is because the support-weighting method does not decrease the weight of he characters as a function of the homoplasy. The difference in running times suggests that support weighting may have a harder time finding a stable solution.

38 6 To perform successive weighting, you can also use the rewt.run script, in the ../dsets/scripts subdirectory (and also in the package of scripts that comes with TNT). This script simply re-weights the characters, based on the trees in memory. So, you have to first make a search, under equal weights (e.g. hold 1000; mult100=hold10;bb;). You then call the re-weighting script (i.e. you type rewt; at the command-line), and it will report whether the weights were changed (case in which you have to search again, until the weights no longer change). Alternatively, you can take advantage of the fact that the re-weighting script writes to the variable number 2 its exit value (when '2' equals 0, it means the weights did not change, otherwise, the weights changed). Then, after doing the first search, if you type: macro=; <enter> loop 1 15 rewt; if ( !'2' ) endloop ; end ; mult 10=ho10; stop <enter> at the command line, TNT will repeat the search (mult10=ho10;) as many times as necessary to stabilize the weights (up to 15 times, in the example).

39

Group Support (Solutions)

1 - For many nodes, the option (a) will produce larger values of support. If you have found the most parsimonious tree for your data set, then the bremer support can only be overestimated, never underestimated: the bremer support of a group is based on subtracting from the length of the best trees without the group if you have failed to find the best trees that lack the group, then you are overestimating support. Thus, in every case where there is a difference between the methods (a) and (b), the larger value is necessarily incorrect. The option (a) overestimates group supports because searching in that way quickly fills the memory (only 10,000 trees) with very long trees; the criterion of acceptance is 15 steps beyond maximum parsimony, and then the first 10,000 rearrangements attempted under TBR will be acceptable. Most of them will be very long trees, with the consequence that the trees which are only 1 or 2 steps longer than the best (and which are needed to identify poorly supported groups) are never found. These are found first with method (b), which first searches for trees 1 step longer, then trees 2 steps longer, etc. The situation would be different if were to use method (a) but saving astronomical numbers of trees eventually the trees only 1 or 2 steps longer than most parsimonious would be found. Since the number of trees to save to allow that would be intractable, any practical study will be forced to use the approach in (b).

2 The measures are generally (if not exactly) correlated. The correlation between relative bremer supports and frequencies under symmetric resampling is R= 0.90. The plot shows the cases:

frequency

relative bremer supports

40 The correlation between relative bremer supports and frequency differences under symmetric resampling is somewhat weaker (R=0.864):

Frequency difference

relative bremer supports

3 To calculate the groups for a reference tree, you have to check on Use groups from tree (in the lower right corner of the dialog of Analyze/Resampling), or use the from option of the resampling command. For example, to calculate frequency differences (GC) under symmetric resampling for the groups in tree 0: resample replications 100 nofreq gc [ mult10=hold 5;] from 0 ; This produces the following values of group support (note that the values in square brackets are negative values i.e. groups indicated by the resampling as contradicted):

41
OPISTOTHEL ATYPIDAE 62 Aliatypus 96 Antr_Atypoi MECICOBOTH PARATROPIDI 100 Scotinoecu [54] Atrax [27] Hexathele 96 Porrhothel [20] Euagrus [15] Chilehexop Ischnothel 97 Diplura Fufius Bolostromus Rhytidicolus 77 Cyrtauchenius Misbolas [38] 95 57 Neocteniza Idiops 86 Myrmekiaphi Ummidia [55] Plesiolena 99 Actinopus [41] 29 Heteromigas 100 Calathotar BARYCHELIDA [39] Glabropelma [27] THERAPHOSIN Ischnocolus Xenonemesia [43] Ixamatus Mexico Ecuador [19] 39 Pseudonemesia 35 Microstigmata Micromygale [35] Acanthogona Stenoteromm [31] Nemesia Neodiplothe

The difference between most replicates and the tree from previous.ctf (used as reference) is (as pointed out) in the placement of PARATROPIDIDAE. Excluding this terminal from the analysis can be accomplished easily from the menus, or with: resample replications 100 nofreq gc [ mult10=hold 5;] from 0 / PARATROPIDI ; The results are then:

42

OPISTOTHEL ATYPIDAE 62 Aliatypus 96 Antr_Atypoi MECICOBOTH Scotinoecu 100 Atrax [27] Hexathele Porrhothel 96 [20] Euagrus [15] Chilehexop Ischnothel Diplura 97 Fufius Bolostromus Rhytidicolus 77 Cyrtauchenius Misbolas 68 95 57 Neocteniza Idiops 86 Myrmekiaphi Ummidia [55] Plesiolena 99 Actinopus 64 29 Heteromigas 100 Calathotar BARYCHELIDA 22 Glabropelma 83 THERAPHOSIN Ischnocolus Xenonemesia 16 Ixamatus Mexico Ecuador [50] 39 Pseudonemesia 35 Microstigmata Micromygale [66] Acanthogona Stenoteromm [62] Nemesia Neodiplothe

which shows that the general structure of the tree is well supported, but the mistake made by that previous proposal was only in the placement of PARATROPIDIDAE.

4 - This exercise illustrates alternative treatments of multiple trees in resampling. Option (a) produces a complete bush. Option (b) produces a well-resolved tree, with X placed in the middle (as in exercise 8 of the lab on Consensus). Variants of option (c) produce a progressively less resolved result. The correct result is the one produced by option (a), and by option (c) saving up to 100 trees. The artifact in (b) is because, when a single tree is saved, there is (roughly) the same probability that a search will and in any of the multiple trees but the majority of those trees have the smaller groups as monophyletic. Thus, the fact that some groups are more frequent in optimal or near-optimal trees causes the distorsion of the measure of group support. This is solved by saving multiple trees (either explicitly, as in (c), or implicitly, as in (a)). In the case of PAUP*, the groups for each pseudoreplicate are weighted according to their frequency in most parsimonious trees for the pseudoreplicate with the consequence that the results are identical to those of option (b).

5 This exercise illustrates the effect of more or less exhaustive searches. In general, as searches become more aggressive (i.e. more likely to find shortest trees) the different pseudoreplicates tend to share more groups. In other words: there are two sources of variation in the groups found during resampling with approximate searches: one results from failure of the search algorithms to find a tree

43 short enough to contain a group that should be there; the other results from the resampling itself. Using more aggressive searches diminishes the first of these effects normally, this produces better values of support for the groups that are actually supported by the data. Since the five search options are arranged, roughly, in increasing exhaustiveness, the support values (which can be seen by typing ttag; after all searches finish) tend to increase.

6 - Given the previous exercise, it is desirable to use a good trade-off between time and accuracy. For this data set, running the last search routine in the previous exercise will take too long. Collapsing with TBR (to make sure we take into account ambiguity; cf. exercise 4) and searching with a single replication using some sectorial search and tree-drifting (to make sure we find short enough trees, cf. exercise 5; see next lab or on-line help of TNT for details on the commands) can be done with: resample rep 100 nofreq gc [ mult1 = hold 1; sec = xss; drift = iter5; ] ; On a fast machine (3GHz), this sequence of commands takes about 3 minutes to complete the resampling for Zilla.

7 The bremer.run macro should produce more accurate (lower) values in many cases.

8 - The values of partitioned bremer supports are:


a g f b +5.00,0.00,-4.00 c +5.00,0.00,-4.00 e +5.00,0.00,-3.00 d

44

Large Data Sets (Solutions)

1 - This is a histogram of the frequencies (in 1000 repetitions) of the routines (a) and (b): Frequency
0.5 0.45 0.4 0.35 0.3 0.25 0.2 0.15 0.1 0.05 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
Column A Column B

Steps beyond minimum

Routines (a) and (b) differ only in the application of tree-fusing (which takes about an additional second). Observe the drammatic difference in the the two tree-length distributions, with almost no additional work (the tree-fusing takes only a few additional seconds). The tree-length frequencies for routines (c) and (d) are:
0.5

Frequency

0.45 0.4 0.35 0.3 0.25 0.2 0.15 0.1 0.05 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20


Column C Column D

Steps beyond minimum

45 Note that, compared to routine (b), this is shifted even more strongly to the left. Tree-fusing (at the end of routine d) does not have such a drammatic effect, since routine (c) is already so close to the minimum length.

2 This is a larger and more difficult data set, with 854 taxa. The tree length distribution for this data set (for routines a-d of the previous exercise, for 1000 times of each routine) are:
0.35

frequency

0.3

0.25

Steps beyond minimum

0.2 Column A Column B Column C Column D

0.15

0.1

0.05

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35

steps beyond minimum In this case, the difference between routines (a) and (b) is even more drammatic, and there is a also a more obvious difference between routines (b) and (c-d). As for the second part of this exercise... The perturbation phase, in tree-drifting, is every other iteration (recall that every other iteration accepts only equally optimal trees). The commands: report +/1; keep 0; mu1=ho1; drift = iter 30 nums 250 fitd 3 rfitd N ; will produce a log of the results of every iteration. The three columns show the lengths after every cycle (first line is accepting only optimal trees, lines in bold are score after perturbation phase); note that there is much more dispersion in the tree lengths after the perturbation phase when accepting trees

46 with a relative fit difference of 1.0, and a slight difference between accepting 0.05 and 0.15: Rel Fit Diff 0.05
23038 23021 23019 23017 23017 23017 23015 23015 23015 23015 23015 23015 23015 23015 23015 23015 23015 23015 23015 23015 23015 23015 23015 23015 23015 23015 23015 23015 23015 23015

Rel Fit Diff 0.15


23038 23021 23017 23017 23017 23018 23018 23017 23014 23015 23014 23018 23014 23017 23015 23020 23014 23018 23015 23015 23014 23014 23010 23013 23009 23010 23009 23009 23008 23009

Rel Fit Diff 1.0


23038 23034 23025 23030 23020 23025 23020 23024 23013 23021 23014 23019 23013 23024 23017 23022 23016 23014 23013 23018 23015 23023 23015 23019 23009 23014 23009 23013 23011 23026

Note that the best final score (23008) was found by accepting RFD 0.15. Accepting RFD 0.05 (best score 230015) produces a lower dispersion of tree lengths after the pertubation phase, but by the same token may make it more difficult to get over the bump to jump islands. Accepting RFD 1.0 (best final score 23011) produces trees that are too long, making it more difficult to come down again to a good length, and (since it accepts rearrangements regardless of whether some characters support the alternative topology) does not properly take into account character conflict.

3 The settings specified produce (on fast machines) a hit to minimum length every 2 minutes, for this data set. Some hits to minimum length simply find that length with one of the initial builds (either when doing sectorial search, or tree-drifting):
Repl. ... 2 2 2 2 Algor. DRIFT DRIFT DRIFT DRIFT Tree 1 2 3 4 Score 23011 23006 23005 23010 Best Score 23011 23006 23005 23005 Time 0:03:16 0:03:27 0:03:39 0:03:51 Rearrangs. 10,813,189,236 11,419,128,752 12,179,322,641 12,816,613,441

47
2 2 2 2 2 2 2 2 DRIFT FUSE FUSE FUSE FUSE FUSE FUSE FUSE 5 0 1 2 0 1 2 5 23008 23005 23005 23005 23005 23008 23005 23005 23005 23005 23005 23005 23005 23005 23005 23005 0:04:03 0:04:03 0:04:04 0:04:06 0:04:06 0:04:07 0:04:07 0:04:07 13,488,271,478 13,488,271,653 13,524,729,521 13,620,301,367 13,657,087,127 13,692,732,003 13,729,213,427 13,729,213,427

The columns indicate (respectively) the number of hit to minimum length (here, always the second), the final algorithm used in each build, the number of build (RAS), the final score for that build, and the best score for that hit among all builds (since this is the third hit to minimum length, the first build begins at 3'16''). Note that the third build finds 23005 (TNT could move to the next hit to minimum length instead of continuing with new builds, but we have not set the options to do so in this example). Other hits to minimum length only produce tree of 23005 steps after fusing the trees resulting from the builds, sometimes only after repeated additional builds:
1 1 1 1 1 1 1 1 1 1 1 1 1 1 DRIFT DRIFT DRIFT DRIFT DRIFT FUSE FUSE FUSE DRIFT DRIFT FUSE FUSE FUSE FUSE 1 2 3 4 5 0 1 2 6 7 0 1 2 0 23007 23006 23007 23006 23007 23006 23006 23006 23006 23006 23006 23005 23005 23005 23007 23006 23006 23006 23006 23006 23006 23006 23006 23006 23006 23005 23005 23005 0:01:45 0:01:58 0:02:09 0:02:21 0:02:33 0:02:33 0:02:34 0:02:35 0:02:46 0:02:58 0:02:58 0:02:59 0:02:59 0:03:00 5,606,228,391 6,337,493,632 7,025,632,985 7,705,510,748 8,430,135,700 8,430,135,881 8,468,280,240 8,504,673,027 9,230,414,277 9,902,274,187 9,902,274,544 9,938,755,119 9,975,214,865 10,011,944,796

in this case, the first five builds have lengths 23007 and 23006; tree-fusing those 5 trees fails to find any shorter trees. Two additional builds (number 6 and 7) are then done (between 0:02:35 and 0:02:58) and the resulting trees, of 23006 steps) are added to the pool of trees to be fused. The second attempt at fusing this enlarged set of trees now produces a tree of 23005 steps.

4, 5 - Nothing to explain.

6 - Running TNT in a 3.0 GHz machine, the ratchet results after 5 minutes are:
Repl. Algor. 98 RAT Tree 0 of 12 Score 61502 Best Score 61502 Time 0:05:00 Rearrangs. 83,372,536,355

which is to say, in 5 minutes TNT completed 98 cycles of ratchet, and found trees of 61502 steps (after looking at 83 billion trees). PAUP* toook about 12 min. to complete the initial RAS+TBR (producing a tree 50 steps longer than the final TNT ratchet tree), and then after an additional 30 min. it had not completed the first perturbation cycle of the ratchet. I got bored and ran out of patience. PAUP* would need well over 5

48 hs. to complete the same number of ratchet cycles as TNT [note: this is the Windows version
of PAUP*, running under Wine; since top showed PAUP* using 70% of the CPU time, I have multiplied the actual times by 0.70].

7 Without constraints, the results are:


Repl. Algor. 6 FUSE Tree 6 Score 61475 Best Score 61475 Time 0:04:43 Rearrangs. 66,543,027,041

With constraints (1537 nodes, consensus estimated using 15 RAS+TBR and collapsing trees with TBR per replicate):
Repl. Algor. 6 FUSE Tree 6 Score 61470 Best Score 61470 Time 0:01:33 Rearrangs. 61,402,178,934

In other words, the search with constraints proceeded about 3 times faster than the search without.

8 - The following table shows the results of running different values of Precision and accuracy for part (a): P r 1 A c c u r a c y (not finding false nodes) 1 2 3 4 5 224 / 0.0134 373 / 0.0214 372 / 0.0134 353 / 0.0113 320 / 0.0000 2 334 / 0.0299 382 / 0.0366 381 / 0.0131 381 / 0.0236 321 / 0.0000 e c i s i (finding true nodes) 3 336 / 0.0327 397 / 0.0428 401 / 0.0150 374 / 0.0160 324 / 0.0000 o n 4 353 / 0.0312 381 / 0.0367 373 / 0.0188 370 / 0.0054 342 / 0.0000 5 387 / 0.0233 376 / 0.0160 351 / 0.0142 356 / 0.0084 329 / 0.0030

The first number is the number of nodes in the estimated consensus (precision); the second number is the proportion of incorrect nodes (i.e. number of false nodes / number of nodes in estimated consensus, or accuracy). The precision determines how each of the 15 searches is done: 1 2 3 4 single RAS, no swapping single RAS+SPR single RAS+TBR 3 RAS+TBR, keeping the best of the 3 trees

49 5 single RAS+TBR plus 15 iterations of ratchet The accuracy determines how the trees are collapsed: 1 2 3 4 5 no collapsing collapsing trees with rule 1 (min. length = 0) SPR collapsing TBR collapsing is not implemented directly from menus; it uses the same as (4) but calculates the strict consensus of the 15 replicates (instead of using the strict consensus of the best 25% of the trees and the 85% majority rule of all the trees, as in the default).

As can be seen from the table, as collapsing is stricter, fewer spurious nodes are found. Note also that using more aggressive searches (moving to the right of the table), the proportion of spurious nodes is not decreased; only moving towards the lower part of the table are the spurious nodes diminished. Part (b): when finding a tree of minimum length, all of the truly supported groups will be present, together with many unsupported ones. The actual numbers of unsupported nodes will be about 95 (if collapsing the trees with rule 1), a few more if collapsing with rule 3, and about 50 if collapsing with TBR (be careful not to temporarily condense the consensus from zillacon.ctf, either by finding zerolength branches or with TBR-collapsing which requires a prior dichotomization of the tree!).

9 - The requested parameters may be set from Analyze/TraditionalSearch, or more easily with the following comands: sec: xss10-7+1-1 fuse 2 ; hold 1000 ; xmult = consense 2 conbase 5 confact 75 giveup 16218 norss xss repl 3; The results of this are:
Repl. Algor. Tree Score Best Score Time 26 FUSE 5 -----16218 0:01:17 Completed search. Total rearrangements examined: 7,426,265,957. No target score defined. Best score hit 27 times. Best score: 16218. 85 trees retained. Consensus (399 nodes, stabilized 2 times) saved as tree 84 Number of nodes in each stabilization: 408, 399 Rearrangs. 7,426,265,957

As can be seen from the report, the first stabilization did find some spurious nodes (producing a tree of 408 nodes, instead of the 399 in the right consensus). The second stabilization, however, produced the right tree (and the strict consensus of the two stabilizations is identical to the second stabilized tree and identical to the tree in ../dsets/zillacon.ctf).

You might also like