Wednesday, December 9, 2009

PyCogent 9: getting serious with Sequence

I'm exploring PyCogent (docs, download page)---first post here.

The most central and basic object type in PyCogent is Sequence. There are several ways to construct a DNA Sequence object. For the demo, I need to load a FASTA-formatted sequence, and recover the title and DNA sequence separately. (Substitute your own to follow along).

from cogent import LoadSeqs, DNA, Sequence

def loadData(fn):
FH = open(fn, 'r')
data = FH.read()
FH.close()
return data

fn = 'cogent/SThemA.fasta'
data = loadData(fn)
n,s = data.strip().split('\n',1)
n = n[1:]

Now, here are five ways to create a DNA Sequence object:

# method 1:  from a string
seq = Sequence(moltype=DNA,
name=n,
seq=s)
print '1', seq[:24]

# method 2: from a text file
tfn = 'cogent/SThemA.txt'
seq = Sequence(moltype=DNA,
filename=tfn,
format='fasta')
print '2', seq[:24]

# method 3: from a file
# guess format from extension
seq = Sequence(moltype=DNA,
filename=fn)
print '3', seq[:24]

# method 4: a method in DNA
seq = DNA.makeSequence(s)
seq.Name = n
print '4', seq[:24]

# method 5: by joining two DNA sequences
a = DNA.makeSequence("AGTACACTGGT")
seq2 = seq[:6] + a
print '5', seq2

# method 6: from an NCBI record (via str)
from cogent.db.ncbi import EFetch
e = EFetch(db='nucleotide',
rettype='fasta',
id='154102')
data = e.read()
n,s = data.strip().split('\n',1)
# go to method 1
seq = Sequence(moltype=DNA,
name=n,
seq=s)
print '6', seq[:24]

output:

1 ATGACCCTTTTAGCGCTCGGTATT
2 ATGACCCTTTTAGCGCTCGGTATT
3 ATGACCCTTTTAGCGCTCGGTATT
4 ATGACCCTTTTAGCGCTCGGTATT
5 ATGACCAGTACACTGGT
6 ATGACCCTTTTAGCGCTCGGTATT

Here is one more that is a little trickier:

# method 7:  using a parser
from cogent.parse.fasta import FastaParser

# must modify for lowercase
f=open(fn)
g = FastaParser(f)
n,s = g.next()
s = str(s).upper()
# go to method 1
seq = Sequence(moltype=DNA,
name=n,
seq=s)
print '7', seq[:24]

# simpler use is uppercase sequence
fn2 = '.temp'
FH = open(fn2,'w')
FH.write('>' + n + '\n' + s.upper())
FH.close()

f=open(fn2)
g = FastaParser(f)
n,s = g.next()
# go to method 1
seq = Sequence(moltype=DNA,
name=n,
seq=s)
print '8', seq[:24]


7 ATGACCCTTTTAGCGCTCGGTATT
8 ATGACCCTTTTAGCGCTCGGTATT

We'll look at the most basic methods of DNA Sequence objects next.