Thursday, December 10, 2009

PyCogent 12: simple Phylogenetic Tree

I'm exploring PyCogent (docs, download page)---first post here.

I want to use the sequences we downloaded last time to make a simple phylogenetic tree. Before we do that, however, let's modify the title lines since they are way too long:

>gi|15384333|gb|AF411019.1| Achromobacter xylosoxidans strain AU0665 16S ribosomal RNA gene, partial sequence

Since I don't know how to do that with an Alignment (or even a Sequence Collection) object yet, we'll modify the original file of downloaded sequences and repeat the alignment.

fn = 'temp.fasta'

def short(s):
stuff,genus,species = s.split()[:3]
gb = stuff.split('|')[3].split('.')[0]
return '_'.join((gb,genus[0],species))

def modifyNames(fn):
FH = open(fn,'r')
data =
L = data.split('\n\n')
pL = list()
for e in L:
title,seq = e.strip().split('\n',1)
title = short(title)
pL.append('>' + title + '\n' + seq)
FH = open(fn,'w')


The next step is to do the alignment. (I showed this code in the other post). Given the alignment, we make the tree like this:

from cogent import LoadSeqs, DNA
from cogent.evolve.models import GTR,HKY85
from cogent.phylo import distance, nj

def prelim():
fn = 'temp.aln.fasta'
seqs = LoadSeqs(fn,format='fasta')
model = HKY85()
dcalc = distance.EstimateDistances(seqs,model)
d = dcalc.getPairwiseDistances()
tree = nj.nj(d)
print '\n' + tree.asciiArt() + '\n'
print d.items()[0][0]
print d.items()[0][1]


The output is (the progress indicator is turned off):

| \-AJ278451_A_xylosoxidans
| | /-AJ002809_A_sp.
| \edge.0--|
-root----| \-EU373389_A_xylosoxidans

('AF411019_A_xylosoxidans', 'DQ450530_A_bacterium')

The last two lines give the first of the pairwise distances. This is a dictionary keyed by a tuple of the the two sequence titles. We'll explore using PyCogent to draw the tree next time. Having written it to disk, we can use our old friend R:

t = read.tree('temp.tree')

The result is shown at the top of the post. There is a slight problem that I haven't bothered to fix. The sequence titles saved in the Tree are quoted, and R prints them as is.