Python for Bioinformatics: Pycogent 16: BLAST parsers

I'm exploring PyCogent (docs, download page)---first post here.

Continuing with the topic from the other day, processing of BLAST output, I was confused about usage of the blast parsers. As defined in cogent.parse.blast:

def MinimalBlastParser9(lines, 
    include_column_names=False):

and in cogent.parse.blast_xml:

def MinimalBlastParser7(lines, 
    include_column_names=False,
    format='xml'):

The first argument is named 'lines', but despite that (as suggested in comments), these functions are looking for a file object. That is, what we're going to do (after we do the BLAST, below) is this:

from cogent.parse.blast import MinimalBlastParser9
blastfile = open('blast_test.txt', 'r')
blast_results = MinimalBlastParser9(blastfile)
L = list(blast_results)

L will then contain one item for each query sequence.

That item is a tuple with two elements
• a dictionary with meta-data
• a list of sub-lists

Each sub-list contains the data for an Hsp.

I'll do the test with a different db containing Achromobacter 16S rRNA seqs.

>AF411020.1
>EF672258.1
>EU373389.1
>DQ450530.1
>AJ278451.1
>AF411019.1
>AJ002809.1

You could grab these as described here. First, we have to format the db for BLAST.

from cogent.app.formatdb import build_blast_db_from_fasta_path
result = build_blast_db_from_fasta_path('temp/achromo.fasta')

blastall \
-i ~/Desktop/temp/achromo_seq.fasta -p blastn -m 9 \
-d ~/Desktop/temp/achromo.fasta \
-o ~/Desktop/blast_test.txt

This data is actually simple enough that you don't really need a parser. I will do this analysis from the interpreter, but I'm not showing the interpreter prompt '>>> ').

from cogent.parse.blast import MinimalBlastParser9
FH = open('blast_test.txt', 'r')
data = FH.readlines()
FH.close()
print data[3].split()[:6]
['#', 'Fields:', 'Query', 'id,', 'Subject', 'id,']
for line in data[4:10]:
    print line.split()[:4]
['AF411020.1', 'AF411020.1', '100.00', '630']
['AF411020.1', 'AF411019.1', '99.84', '630']
['AF411020.1', 'DQ450530.1', '99.68', '630']
['AF411020.1', 'EU373389.1', '99.37', '630']
['AF411020.1', 'AJ278451.1', '99.21', '630']
['AF411020.1', 'EF672258.1', '99.05', '630']

We parse as follows:

from cogent.parse.blast import MinimalBlastParser9
blastfile = open('blast_test.txt', 'r')
blast_results = MinimalBlastParser9(blastfile)
L = list(blast_results)

for t in L:
    metaD, subL = t
    fields = metaD['FIELDS'].split(',')
    fields = [f.strip() for f in fields]
    for e in subL:
        for k,v in zip(fields,e):
     print k.ljust(25), v
        print

Data for the first hit looks like this:

Query id                  AF411020.1
Subject id                AF411020.1
% identity                100.00
alignment length          630
mismatches                0
gap openings              0
q. start                  1
q. end                    630
s. start                  1
s. end                    630
e-value                   0.0
bit score                 1249

Note that this format does not contain the aligned sequences.

Now, let's do it with XML-formatted ouptut:

blastall \
-i ~/Desktop/temp/achromo_seq.fasta -p blastn -m 7 \
-d ~/Desktop/temp/achromo.fasta \
-o ~/Desktop/blast_test.xml

from cogent.parse.blast_xml import MinimalBlastParser7
blastfile = open('blast_test.xml', 'r')
blast_results = MinimalBlastParser7(blastfile)
L = list(blast_results)

Again, L contains one item for each query sequence. That item is a tuple with two elements

• a dictionary with meta-data
• a list of sub-lists

Each sub-list is the data for an Hsp, whose first element is a list of keys. This is a slightly different arrangement than for the first example.

blast_stuff, subL = L[0]
len(subL)
kL = subL[0]
hit1 = subL[1]
for k,v in zip(kL,hit1):
    if 'ALIGN' in k: continue
    print k.ljust(20), v

QUERY ID             1
SUBJECT_ID           lcl|AF411020.1
HIT_DEF              No definition line found
HIT_ACCESSION        AF411020.1
HIT_LENGTH           1490
PERCENT_IDENTITY     630
MISMATCHES           0
GAP_OPENINGS         0
QUERY_START          1
QUERY_END            630
SUBJECT_START        1
SUBJECT_END          630
E_VALUE              0
BIT_SCORE            1249.38
SCORE                630
POSITIVE             630

for k in ('QUERY_ALIGN','MIDLINE_ALIGN','SUBJECT_ALIGN'):
    print hit1[kL.index(k)][:40]

AGTTTGATCCTGGCTCAGATTGAACGCTAGCGGGATGCCT
||||||||||||||||||||||||||||||||||||||||
AGTTTGATCCTGGCTCAGATTGAACGCTAGCGGGATGCCT

And while the BLAST parsers have an input argument labeled data but want a file object, the Fasta parser has input arg labeled infile and wants a string! (and note: you must do readlines() rather than read()).

from cogent.parse.fasta import MinimalFastaParser
path = 'temp/achromo.fasta'
FH = open(path)
data = FH.readlines()
FH.close()
p = MinimalFastaParser(data)
for e in p:
    print e[0]
    print e[1][:40]

AF411020.1
AGTTTGATCCTGGCTCAGATTGAACGCTAGCGGGATGCCT
EF672258.1
TTAGAGTTTGATCCTGGCTCAGATTGAACGCTAGCGGGAT
EU373389.1
TCGGAGAGTTTGATCCTGGCTCAGATTGAACGCTAGCGGG
DQ450530.1
ATTAGAGTTTGATCCTGGCTCAGATTGAACGCTAGCGGGA
AJ278451.1
AGAGTTTGATCATGGCTCAGATTGAACGCTAGCGGGATGC
AF411019.1
AGTTTGATCCTGGCTCAGATTGAACGCTAGCGGGATGCCT
AJ002809.1
ATTGAACGCTAGCGGGATGCCTTACACATGCAAGTCGAAC

Saturday, December 26, 2009

Pycogent 16: BLAST parsers