Python for Bioinformatics: Unifrac (4): Web results

Wednesday, February 10, 2010

Unifrac (4): Web results

I'm exploring UniFrac in a series of posts (first here). Last time, we produced a set of simulated sequences from three samples, or environments. I ended up fiddling with that a bit more. I changed this line in my code to drop two of the A sequences out (A11 and A12 in the original example). I did this because the initial set did not show any significant difference between samples.

A = {0:5,1:5,2:0,3:0,4:0,5:0,6:1,7:1}

(Because the sequences are numbered in order, in the new set the labels A11 and A12 have been reassigned to the previous A13 and A14). The sequences are organized to make a phylogenetic tree which was written to disk as 'samples.tree'. I also wrote a simple script to process the titles of the sequences and save them in a file named 'environ.txt'. The first line is:

A1 A

Now let's use these two files as input to UniFrac (web interface). We compute the UniFrac metric or distance for each pair of samples. As you can see, we're going to do 1000 randomizations (this requires simple registration with the site):

and evaluate the signficance of the results by repeated randomization of the labels. The raw p-values

need to be corrected by multiplying by the total number of comparisons (Bonferroni correction).

Notice that only A and B are significantly different, although it was a near-run thing with A and C. We download the actual UniFrac values for comparison with what PyCogent gives us. The text version of the file looks like this:

. A B C
A 0 0.666882546144 0.654995703131
B 0.666882546144 0 0.576182797703
C 0.654995703131 0.576182797703 0

We also ask for a PCA (Principal Coordinates Analysis). Here is the plot of the two principal coordinates:

And here are the eigenvalues and eigenvectors in a data file:

pc vector number  1 2 3 
A -0.396005816461 -0.0214251320143 -3.04168679166e-09 
B 0.22015648746 -0.276525789753 -3.04168679166e-09 
C 0.175849329001 0.297950921767 -3.04168679166e-09 


eigvals 0.236212472152 0.165700300462 2.77555756156e-17 
percent variation explained 58.7720740039 41.2279259961 6.90587050397e-15

Of course, we can always replot using R:

> x = c(-0.396,0.220,0.176)
> y = c(-0.021,-0.277,0.298)
> plot(x,y,pch=c('A','B','C'),cex=2)

You can make this as fancy as you like.