Wednesday, February 10, 2010

Unifrac (5): results using PyCogent



Last time, I used a set of simulated sequences and their phylogenetic tree as input to the web interface of Unifrac (the original version). This time, I propose to repeat the analysis on my laptop using PyCogent. (See the PyCogent category in the sidebar for more from me on this, or start with this post). I'm just going to do the analysis here. Next time I want to look the distinction between PCA and PCoA.

Provided that we've done our imports, and loaded both the tree and the environment, the actual analysis runs in a single line. The results are shown first, followed by the code. The commented lines in the code were used in the first implementation based on the usage example, but aren't needed here.

There are a couple of things to notice. As for the rest of PyCogent, objects bearing results act like dicts, and we recover the elements of interest with result['pcoa'] (for example). Second, the distance matrix (a table of UniFrac metrics for pairwise comparisons between each of the three samples) is slightly different than what we got from the web interface. The comparison of A with C gives exactly the same result, but the other two have changed. Therefore, it is not surprising that the PCoA results are different. It is however, a bit of a shock to see that the small change in the ED (Environmental Distance) matrix has rotated the plot through about 170°.

[UPDATE: I forgot to mention that I had altered PyCogent slightly to always do PCoA. I think what you will need to do with this code is to pass an argument to fast_unifrac of modes = ['distance_matrix','pcoa'], but I haven't tested it yet.]


(array([[ 0.        ,  0.69276243,  0.6549957 ],
[ 0.69276243, 0. , 0.60625465],
[ 0.6549957 , 0.60625465, 0. ]]), ['A', 'B', 'C'])
================================================================
Type Label vec_num-0 vec_num-1 vec_num-2
----------------------------------------------------------------
Eigenvectors A 0.39 0.08 -0.00
Eigenvectors B -0.28 0.25 -0.00
Eigenvectors C -0.12 -0.33 -0.00
Eigenvalues eigenvalues 0.25 0.18 0.00
Eigenvalues var explained (%) 57.64 42.36 0.00
----------------------------------------------------------------




from cogent import LoadTree
#from cogent.parse.tree import DndParser
#from cogent.maths.unifrac.fast_tree import UniFracTreeNode
from cogent.maths.unifrac.fast_unifrac import fast_unifrac

# convert our environment file repr back into a dict
FH = open('environ.txt','r')
data = FH.read().strip()
FH.close()
L = data.split('\n')
env_dict = dict()
for e in L:
seq_name, sample = e.strip().split()
env_dict[seq_name] = {sample:1}

tr = LoadTree('samples.tree')
#tree_str = tr.getNewick(with_distances=True)
#tr = DndParser(tree_str, UniFracTreeNode)
result = fast_unifrac(tr, env_dict)
print result['distance_matrix']
print result['pcoa']

'''
R code:
x = c(0.39,-0.28,-0.12)
y = c(0.08,0.25,-0.33)
plot(x,y,pch=c('A','B','C'),cex=2)
'''