Wednesday, December 9, 2009

PyCogent 11: start w/multiple sequences

I'm exploring PyCogent (docs, download page)---first post here.

Next up are multiple sequences. In the first part we grab some rRNA sequences from NCBI.

import sys
from cogent.db.ncbi import EFetch

L = ['AF411020.1','EU373389.1',
'DQ450530.1','AJ278451.1',
'AF411019.1','AJ002809.1']
fn = 'temp.fasta'

def test1():
e = EFetch(db='nucleotide',
rettype='fasta',
id=','.join(L))
s = e.read()
#print s
FH = open(fn,'w')
FH.write(s)
FH.close()

test1()

In the second part, we use muscle to align them (like before):

from cogent import LoadSeqs, DNA
from cogent.app.muscle import align_unaligned_seqs as malign

def test2():
seqs = LoadSeqs(filename=fn,
aligned=False,
format='fasta')
print type(seqs)
aln = malign(seqs,DNA)
aln.writeToFile('temp.aln.fasta')

test2()


<class 'cogent.core.alignment.SequenceCollection'>

In the third part we just load the aligned sequences back into Python.

def test3():
fn = 'temp.aln.fasta'
sc = LoadSeqs(fn,format='fasta')
print sc[:40]
print type(sc)

test3()


>gi|21436540|emb|AJ278451.1| Achromobacter xylosoxidans subsp. denitrificans partial 16S rRNA gene, strain DSM 30026 (T)
----AGAGTTTGATCATGGCTCAGATTGAACGCTAGCGGG
>gi|15384333|gb|AF411019.1| Achromobacter xylosoxidans strain AU0665 16S ribosomal RNA gene, partial sequence
------AGTTTGATCCTGGCTCAGATTGAACGCTAGCGGG
>gi|2832590|emb|AJ002809.1| Alcaligenes sp. 16S rRNA gene, isolate R6
------------------------ATTGAACGCTAGCGGG
>gi|92919431|gb|DQ450530.1| Alcaligenaceae bacterium LBM 16S ribosomal RNA gene, partial sequence
-ATTAGAGTTTGATCCTGGCTCAGATTGAACGCTAGCGGG
>gi|171191189|gb|EU373389.1| Achromobacter xylosoxidans strain TPL14 16S ribosomal RNA gene, partial sequence
TCGGAGAGTTTGATCCTGGCTCAGATTGAACGCTAGCGGG
>gi|15384334|gb|AF411020.1| Achromobacter xylosoxidans strain AU1011 16S ribosomal RNA gene, partial sequence
------AGTTTGATCCTGGCTCAGATTGAACGCTAGCGGG

<class 'cogent.core.alignment.Alignment'>

Notice that when we load unaligned sequences, we have a cogent.core.alignment.SequenceCollection, while an aligned set is a cogent.core.alignment.Alignment. We'll use these going forward. And yes, those are mighty long title lines. And we can do slices on the alignment.