Python for Bioinformatics: Phylogenetic Trees: rooting

Wednesday, July 22, 2009

Phylogenetic Trees: rooting

What does it mean to "root" a phylogenetic tree? Trees drawn using simple methods like neighbor-joining (NJ) are unrooted. The tree's horizontal structure reflects "distances" between the sequences being compared, but the choice of which internal node lies to the extreme left (the root) is usually arbitrary. The standard way to root a tree produced by this method is to include an outgroup in the set being analyzed---a sequence for which it is "known" that all the other sequences are more closely related among themselves than to the outgroup. Then, we know that the ancestral sequence lies on the evolutionary path between the outgroup and all the other sequences.

The tree which I showed last time was made by Clustal using the NJ algorithm.

(
(
Stenotrophomonas_maltophilia:0.07574,
Kingella_oralis:0.08026)
:0.00827,
Pseudomonas_aeruginosa:0.05950,
(
(
Salmonella_typhi:0.01297,
Escherichia_coli:0.01491)
:0.03356,
Haemophilus_parainfluenzae:0.06113)
:0.03863);

The newlines in this are not relevant (although the formatting helps us to see the structure), so this representation is equally valid:

((Stenotrophomonas_maltophilia:0.07574,Kingella_oralis:0.08026):0.00827,Pseudomonas_aeruginosa:0.05950,((Salmonella_typhi:0.01297,Escherichia_coli:0.01491):0.03356,Haemophilus_parainfluenzae:0.06113):0.03863);

When I loaded the tree into R

library(ape)
setwd('Desktop')
tree = read.tree('seq.6.ph')
plot(tree,cex=1.3)

I had this:

> tree

Phylogenetic tree with 6 tips and 4 internal nodes.

Tip labels:
[1] "Stenotrophomonas_maltophilia"
[2] "Kingella_oralis"             
[3] "Pseudomonas_aeruginosa"      
[4] "Salmonella_typhi"            
[5] "Escherichia_coli"            
[6] "Haemophilus_parainfluenzae"  

Unrooted; includes branch lengths.

:
Note that APE considers it to be unrooted. We can confirm the position designations of the tips by the following code. We first do the plot, suppressing the tip labels (species names), then add the tip labels. The first argument to the function tiplabels is the vector of labels, the second is the vector of tip positions to be labeled. We see that they are numbered simply from bottom to top.

plot(tree,cex=2,
  show.tip.label=F)
tiplabels(as.character(1:6),1:6,cex=2,
  frame='c',bg='salmon',adj=c(0.5,0.5))

And we can find the internal nodes as follows:

nodelabels(cex=2)

We can root the tree by picking the outgroup (node #2 = Kingella oralis) as follows:

t2 = root(tree,2)  # node 2
plot(t2,cex=2,
  show.tip.label=F)
tiplabels(as.character(1:6),1:6,cex=2,
  frame='c',bg='salmon',adj=c(0.5,0.5))

If I write the data for the new tree to disk

write.tree(t2,'x.phy')

I get:

((Pseudomonas_aeruginosa:0.0595,((Salmonella_typhi:0.01297,Escherichia_coli:0.01491):0.03356,Haemophilus_parainfluenzae:0.06113):0.03863):0.00827,Stenotrophomonas_maltophilia:0.07574,Kingella_oralis:0.08026);

which is not quite the same as what we started with. Replacing the species by two letter abbreviations and removing the lengths of the terminal branches, we started with:

((SM,KO):0.00827,PA,((ST,EC):0.03356,HP):0.03863);

Now we have:

((PA,((ST,EC):0.03356,HP):0.03863):0.00827,SM:0.07574,KO);

The distance (0.00827) which was previously assigned to the SM,KO clade (nodes 1,2) is now assigned to the PA,ST,EC,HP clade (nodes 3-6).