Saturday, December 26, 2009

Pycogent 16: BLAST parsers

I'm exploring PyCogent (docs, download page)---first post here.

Continuing with the topic from the other day, processing of BLAST output, I was confused about usage of the blast parsers. As defined in cogent.parse.blast:

def MinimalBlastParser9(lines, 
include_column_names=False):

and in cogent.parse.blast_xml:

def MinimalBlastParser7(lines, 
include_column_names=False,
format='xml'):

The first argument is named 'lines', but despite that (as suggested in comments), these functions are looking for a file object. That is, what we're going to do (after we do the BLAST, below) is this:

from cogent.parse.blast import MinimalBlastParser9
blastfile = open('blast_test.txt', 'r')
blast_results = MinimalBlastParser9(blastfile)
L = list(blast_results)

L will then contain one item for each query sequence.

That item is a tuple with two elements
• a dictionary with meta-data
• a list of sub-lists

Each sub-list contains the data for an Hsp.

I'll do the test with a different db containing Achromobacter 16S rRNA seqs.

>AF411020.1
>EF672258.1
>EU373389.1
>DQ450530.1
>AJ278451.1
>AF411019.1
>AJ002809.1

You could grab these as described here. First, we have to format the db for BLAST.

from cogent.app.formatdb import build_blast_db_from_fasta_path
result = build_blast_db_from_fasta_path('temp/achromo.fasta')


blastall \
-i ~/Desktop/temp/achromo_seq.fasta -p blastn -m 9 \
-d ~/Desktop/temp/achromo.fasta \
-o ~/Desktop/blast_test.txt


This data is actually simple enough that you don't really need a parser. I will do this analysis from the interpreter, but I'm not showing the interpreter prompt '>>> ').

from cogent.parse.blast import MinimalBlastParser9
FH = open('blast_test.txt', 'r')
data = FH.readlines()
FH.close()
print data[3].split()[:6]
['#', 'Fields:', 'Query', 'id,', 'Subject', 'id,']
for line in data[4:10]:
print line.split()[:4]
['AF411020.1', 'AF411020.1', '100.00', '630']
['AF411020.1', 'AF411019.1', '99.84', '630']
['AF411020.1', 'DQ450530.1', '99.68', '630']
['AF411020.1', 'EU373389.1', '99.37', '630']
['AF411020.1', 'AJ278451.1', '99.21', '630']
['AF411020.1', 'EF672258.1', '99.05', '630']

We parse as follows:

from cogent.parse.blast import MinimalBlastParser9
blastfile = open('blast_test.txt', 'r')
blast_results = MinimalBlastParser9(blastfile)
L = list(blast_results)

for t in L:
metaD, subL = t
fields = metaD['FIELDS'].split(',')
fields = [f.strip() for f in fields]
for e in subL:
for k,v in zip(fields,e):
print k.ljust(25), v
print

Data for the first hit looks like this:

Query id                  AF411020.1
Subject id AF411020.1
% identity 100.00
alignment length 630
mismatches 0
gap openings 0
q. start 1
q. end 630
s. start 1
s. end 630
e-value 0.0
bit score 1249

Note that this format does not contain the aligned sequences.

Now, let's do it with XML-formatted ouptut:

blastall \
-i ~/Desktop/temp/achromo_seq.fasta -p blastn -m 7 \
-d ~/Desktop/temp/achromo.fasta \
-o ~/Desktop/blast_test.xml


from cogent.parse.blast_xml import MinimalBlastParser7
blastfile = open('blast_test.xml', 'r')
blast_results = MinimalBlastParser7(blastfile)
L = list(blast_results)

Again, L contains one item for each query sequence. That item is a tuple with two elements

• a dictionary with meta-data
• a list of sub-lists

Each sub-list is the data for an Hsp, whose first element is a list of keys. This is a slightly different arrangement than for the first example.

blast_stuff, subL = L[0]
len(subL)
kL = subL[0]
hit1 = subL[1]
for k,v in zip(kL,hit1):
if 'ALIGN' in k: continue
print k.ljust(20), v


QUERY ID             1
SUBJECT_ID lcl|AF411020.1
HIT_DEF No definition line found
HIT_ACCESSION AF411020.1
HIT_LENGTH 1490
PERCENT_IDENTITY 630
MISMATCHES 0
GAP_OPENINGS 0
QUERY_START 1
QUERY_END 630
SUBJECT_START 1
SUBJECT_END 630
E_VALUE 0
BIT_SCORE 1249.38
SCORE 630
POSITIVE 630


for k in ('QUERY_ALIGN','MIDLINE_ALIGN','SUBJECT_ALIGN'):
print hit1[kL.index(k)][:40]


AGTTTGATCCTGGCTCAGATTGAACGCTAGCGGGATGCCT
||||||||||||||||||||||||||||||||||||||||
AGTTTGATCCTGGCTCAGATTGAACGCTAGCGGGATGCCT

And while the BLAST parsers have an input argument labeled data but want a file object, the Fasta parser has input arg labeled infile and wants a string! (and note: you must do readlines() rather than read()).

from cogent.parse.fasta import MinimalFastaParser
path = 'temp/achromo.fasta'
FH = open(path)
data = FH.readlines()
FH.close()
p = MinimalFastaParser(data)
for e in p:
print e[0]
print e[1][:40]


AF411020.1
AGTTTGATCCTGGCTCAGATTGAACGCTAGCGGGATGCCT
EF672258.1
TTAGAGTTTGATCCTGGCTCAGATTGAACGCTAGCGGGAT
EU373389.1
TCGGAGAGTTTGATCCTGGCTCAGATTGAACGCTAGCGGG
DQ450530.1
ATTAGAGTTTGATCCTGGCTCAGATTGAACGCTAGCGGGA
AJ278451.1
AGAGTTTGATCATGGCTCAGATTGAACGCTAGCGGGATGC
AF411019.1
AGTTTGATCCTGGCTCAGATTGAACGCTAGCGGGATGCCT
AJ002809.1
ATTGAACGCTAGCGGGATGCCTTACACATGCAAGTCGAAC