Thursday, December 10, 2009

PyCogent 12: simple Phylogenetic Tree



I'm exploring PyCogent (docs, download page)---first post here.

I want to use the sequences we downloaded last time to make a simple phylogenetic tree. Before we do that, however, let's modify the title lines since they are way too long:


>gi|15384333|gb|AF411019.1| Achromobacter xylosoxidans strain AU0665 16S ribosomal RNA gene, partial sequence


Since I don't know how to do that with an Alignment (or even a Sequence Collection) object yet, we'll modify the original file of downloaded sequences and repeat the alignment.


fn = 'temp.fasta'

def short(s):
stuff,genus,species = s.split()[:3]
gb = stuff.split('|')[3].split('.')[0]
return '_'.join((gb,genus[0],species))

def modifyNames(fn):
FH = open(fn,'r')
data = FH.read().strip()
FH.close()
L = data.split('\n\n')
pL = list()
for e in L:
title,seq = e.strip().split('\n',1)
title = short(title)
pL.append('>' + title + '\n' + seq)
FH = open(fn,'w')
FH.write('\n\n'.join(pL))
FH.close()

modifyNames(fn)


The next step is to do the alignment. (I showed this code in the other post). Given the alignment, we make the tree like this:


from cogent import LoadSeqs, DNA
from cogent.evolve.models import GTR,HKY85
from cogent.phylo import distance, nj

def prelim():
fn = 'temp.aln.fasta'
seqs = LoadSeqs(fn,format='fasta')
model = HKY85()
dcalc = distance.EstimateDistances(seqs,model)
dcalc.run(show_progress=False)
d = dcalc.getPairwiseDistances()
tree = nj.nj(d)
print '\n' + tree.asciiArt() + '\n'
tree.writeToFile('temp.tree')
print d.items()[0][0]
print d.items()[0][1]

prelim()


The output is (the progress indicator is turned off):


                              /-DQ450530_A_bacterium
/edge.1--|
| \-AJ278451_A_xylosoxidans
/edge.2--|
| | /-AJ002809_A_sp.
| \edge.0--|
-root----| \-EU373389_A_xylosoxidans
|
|--AF411020_A_xylosoxidans
|
\-AF411019_A_xylosoxidans

('AF411019_A_xylosoxidans', 'DQ450530_A_bacterium')
0.00403969203526


The last two lines give the first of the pairwise distances. This is a dictionary keyed by a tuple of the the two sequence titles. We'll explore using PyCogent to draw the tree next time. Having written it to disk, we can use our old friend R:


setwd('Desktop')
library(ape)
t = read.tree('temp.tree')
plot(t)
plot(t,cex=1.5)
axisPhylo()
plot(t,cex=1.3)
axisPhylo()


The result is shown at the top of the post. There is a slight problem that I haven't bothered to fix. The sequence titles saved in the Tree are quoted, and R prints them as is.