Friday, December 4, 2009

PyCogent (2)

I've started playing with PyCogent. The first simple tutorial is here. We start by importing stuff:

from cogent import LoadSeqs, DNA
fn = 'SThemA.txt'
s = LoadSeqs(fn)

This fails, even though the file is in FASTA format.

cogent.parse.record.FileFormatError: Unsupported file format txt

We must change the extension.

fn = 'SThemA.fasta'
s = LoadSeqs(fn)
print s.MolType

Now, this kinda sorta works, but there is a slight problem:

MolType(('a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z', 'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X', 'Y', 'Z'))

We must tell it what type of characters we're dealing with.

fn = 'SThemA.fasta'
s = LoadSeqs(fn,moltype=DNA,aligned=False)
print s.MolType
print s.SeqLen

That's much better:

MolType(('T', 'C', 'A', 'G'))
1257

Let's print the sequence. The commented versions don't work:

# print s[:24]
# print s.toDna()[:24]
print s.toFasta()[:24]


>ST hemA
ATGACCCTTTTAGCG

The error with the first line is:

TypeError: 'SequenceCollection' object is unsubscriptable

Finally, let's translate:

print s.getTranslation()


>ST hemA
MTLLALGINHKTAPVSLRERVTFSPDTLDQALDSLLAQPMVQGGVVLSTCNRTELYLSVEEQDNLQEALIRWLCDYHNLNEDDLRNSLYWHQDNDAVSHLMRVASGLDSLVLGEPQILGQVKKAFADSQKGHLNASALERMFQKSFSVAKRVRTETDIGASAVSVAFAACTLARQIFESLSTVTVLLVGAGETIELVARHLREHKVQKMIIANRTRERAQALADEVGAEVISLSDIDARLQDADIIISSTASPLPIIGKGMVERALKSRRNQPMLLVDIAVPRDVEPEVGKLANAYLYSVDDLQSIISHNLAQRQAAAVEAETIVEQEASEFMAWLRAQGASETIREYRSQSEQIRDELTTKALSALQQGGDAQAILQDLAWKLTNRLIHAPTKSLQQAARDGDDERLNILRDSLGLE

But, it only works if I trim off the stop codon in the file. I have no idea how to do this from code, since the objects don't support slicing.

print s.__dict__


{'Info': {'Refs': {}}, '_seqs': [DnaSequence(ATGACCC... 1254)], 'Name': None, 'NamedSeqs': {'ST hemA': DnaSequence(ATGACCC... 1254)}, 'Alphabet': ('T', 'C', 'A', 'G', '-', 'B', 'D', 'H', 'K', 'M', 'N', 'S', 'R', 'W', 'V', 'Y', '?'), 'SeqLen': 1254, 'Names': ['ST hemA'], 'SeqData': [DnaSequence(ATGACCC... 1254)], 'MolType': MolType(('T', 'C', 'A', 'G'))}


Any ideas?

What we can do is manipulate the data before making the object. Now, we can even use a text file!


FH = open('SThemA.txt')
data = FH.read().strip()
data = data.split('\n',1)[1][:-3]
seqs = { 'SThemA':data }
s = LoadSeqs(data=seqs,
moltype=DNA,aligned=False)
print s.getTranslation().toFasta()[:24]


>ST hemA
MTLLALGINHKTAPV


We had to split off the title line first, then lose the last three bases. That is good enough for now.

1 comment:

Gavin said...

A few comments regarding how you're doing things.

First, to get help on something in python do help(object). Doing this on the LoadSeqs function shows a format argument. You could have used this, setting it to format='fasta' and then you don't need to change the filename.

Second, dir(obj) shows the object attributes, including method names. In PyCogent we use theMixedCamelCase to indicate methods. If you do dir(s) on your example you would see (among many other methods) withoutTerminalStopCodons. So you can get the translation as: s=s.withoutTerminalStopCodons(); s.getTranslation().

Third, some of the errors you are seeing on trying to slice s are because it's a sequence collection. This is a collection of unaligned sequences or, at least, you haven't said they're aligned. In which case slicing is not defined.