Thursday, December 24, 2009

Processing BLAST output

I was hoping to figure out how to use PyCogent's parsers to handle the output from BLAST, but I haven't succeeded at that yet. So, before I leave this topic (at least for a while) I thought I would show again the solution that I've used for my own projects. I talked about this before here. For starters, I think we want to use XML, since NCBI may potentially change other formats at will.

You can view a BLAST search as having up to 3 levels of elements:

• multiple sequences that are in the same query file
• multiple hits, sequences in the database that match
• multiple HSPs (high-scoring segment pairs), within each hit

Here is a graphic from the old post which shows this reflected in the structure of an XML file.



An HSP has a query sequence (qseq), hit sequence (hseq) and midline--- '|' characters showing matches or a space for no match. For what follows, I'll consider just the simpler case where only a single query has been input to BLAST.

Using the setup from last time (here and here), I do this from the command line:

~/Software/blast/programs/blast-2.2.22/bin/blastall \
-i ~/Desktop/temp/inseqs.fasta -p blastn -m 7 \
-d ~/Desktop/temp/refseqs.fasta \
-o ~/Desktop/blast_test.xml

Now we want to parse the results in blast_test.xml. The first section is the parsing code, and the second does the printing.

import xml.etree.ElementTree as ET

def parseBLASTIteration(iteration, howmany=3):
hitL = list()
for hit in iteration.findall('Iteration_hits/Hit')[:howmany]:
hitD = dict()
for k in ['Hit_id','Hit_def','Hit_accession']:
hitD[k] = hit.findtext(k)
hitD['hsps'] = list()
for hsp in hit.findall('Hit_hsps'):
hspD = dict()
hspD['score'] = hsp.findtext('Hsp/Hsp_score')
hspD['evalue'] = hsp.findtext('Hsp/Hsp_evalue')
hspD['identity'] = hsp.findtext('Hsp/Hsp_identity')
hspD['gaps'] = hsp.findtext('Hsp/Hsp_gaps')
hspD['length'] = hsp.findtext('Hsp/Hsp_align-len')
hspD['query'] = hsp.findtext('Hsp/Hsp_qseq')
hspD['midline'] = hsp.findtext('Hsp/Hsp_midline')
hspD['hitseq'] = hsp.findtext('Hsp/Hsp_hseq')
identity = int(hspD['identity'])
length = int(hspD['length'])
try:
hspD['%identity'] = identity*100.0/length
except ZeroDivisionError:
hspD['%identity'] = 'error'
hitD['hsps'].append(hspD)
hitL.append(hitD)
return hitL

def parseSingleIteration(tree,howmany=3):
iteration = tree.find('BlastOutput_iterations/Iteration')
hitL = parseBLASTIteration(iteration,howmany)
return hitL
#---------------------------------------------------------

def showHitList(hitL,withaccession=True):
for j,hitD in enumerate(hitL):
print 'hit #', j+1
for k in ['Hit_id','Hit_def','Hit_accession']:
print k, hitD[k]
if withaccession:
print hitD['Hit_accession'].ljust(10),
hspL = hitD['hsps']
for hspD in hspL[:1]:
print hspD['identity'] + '/' + hspD['length'],
print ('%3.2f' % hspD['%identity']).rjust(7)
printHspAlignment(hspD)
print

def printHspAlignment(hspD):
line_length = 60
query = hspD['query']
midline = hspD['midline']
hitseq = hspD['hitseq']
for i in range(0,len(query),line_length):
print query[i:i+line_length]
print midline[i:i+line_length]
print hitseq[i:i+line_length]

fn = 'blast_test.xml'
tree = ET.parse(fn)
hitL = parseSingleIteration(tree)
showHitList(hitL[:3])

Here is the output:

$ python parseBlastXML.py 
hit # 1
Hit_id lcl|s2
Hit_def No definition line found
Hit_accession s2
s2 26/29 89.66
TGCAGCTTGAG-CACAGGTTAGAGCCTTC
||||||||||| |||||| |||||||||
TGCAGCTTGAGCCACAGGAGAGAGCCTTC

hit # 2
Hit_id lcl|s3
Hit_def No definition line found
Hit_accession s3
s3 12/12 100.00
TGCAGCTTGAGC
||||||||||||
TGCAGCTTGAGC

hit # 3
Hit_id lcl|s1
Hit_def No definition line found
Hit_accession s1
s1 12/12 100.00
TGCAGCTTGAGC
||||||||||||
TGCAGCTTGAGC