Python for Bioinformatics: Processing BLAST output

Thursday, December 24, 2009

Processing BLAST output

I was hoping to figure out how to use PyCogent's parsers to handle the output from BLAST, but I haven't succeeded at that yet. So, before I leave this topic (at least for a while) I thought I would show again the solution that I've used for my own projects. I talked about this before here. For starters, I think we want to use XML, since NCBI may potentially change other formats at will.

You can view a BLAST search as having up to 3 levels of elements:

• multiple sequences that are in the same query file
• multiple hits, sequences in the database that match
• multiple HSPs (high-scoring segment pairs), within each hit

Here is a graphic from the old post which shows this reflected in the structure of an XML file.

An HSP has a query sequence (qseq), hit sequence (hseq) and midline--- '|' characters showing matches or a space for no match. For what follows, I'll consider just the simpler case where only a single query has been input to BLAST.

Using the setup from last time (here and here), I do this from the command line:

~/Software/blast/programs/blast-2.2.22/bin/blastall \
-i ~/Desktop/temp/inseqs.fasta -p blastn -m 7 \
-d ~/Desktop/temp/refseqs.fasta \
-o ~/Desktop/blast_test.xml

Now we want to parse the results in blast_test.xml. The first section is the parsing code, and the second does the printing.

import xml.etree.ElementTree as ET

def parseBLASTIteration(iteration, howmany=3):
    hitL = list()
    for hit in iteration.findall('Iteration_hits/Hit')[:howmany]:
        hitD = dict()
        for k in ['Hit_id','Hit_def','Hit_accession']:
            hitD[k] = hit.findtext(k)
        hitD['hsps'] = list()
        for hsp in hit.findall('Hit_hsps'):
            hspD = dict()
            hspD['score'] =    hsp.findtext('Hsp/Hsp_score')
            hspD['evalue'] =   hsp.findtext('Hsp/Hsp_evalue')
            hspD['identity'] = hsp.findtext('Hsp/Hsp_identity')
            hspD['gaps'] =     hsp.findtext('Hsp/Hsp_gaps')
            hspD['length'] =   hsp.findtext('Hsp/Hsp_align-len')
            hspD['query'] =    hsp.findtext('Hsp/Hsp_qseq')
            hspD['midline'] =  hsp.findtext('Hsp/Hsp_midline')
            hspD['hitseq'] =   hsp.findtext('Hsp/Hsp_hseq')
            identity = int(hspD['identity'])
            length = int(hspD['length'])
            try:
                hspD['%identity'] = identity*100.0/length
            except ZeroDivisionError:
                hspD['%identity'] = 'error'
            hitD['hsps'].append(hspD)
        hitL.append(hitD)
    return hitL

def parseSingleIteration(tree,howmany=3):
    iteration = tree.find('BlastOutput_iterations/Iteration')
    hitL = parseBLASTIteration(iteration,howmany)
    return hitL
#---------------------------------------------------------

def showHitList(hitL,withaccession=True):
    for j,hitD in enumerate(hitL):
        print 'hit #', j+1
        for k in ['Hit_id','Hit_def','Hit_accession']:
            print k, hitD[k]
        if withaccession:
            print hitD['Hit_accession'].ljust(10),
        hspL = hitD['hsps']
        for hspD in hspL[:1]:
            print hspD['identity'] + '/' + hspD['length'],
            print ('%3.2f' % hspD['%identity']).rjust(7)
            printHspAlignment(hspD)
        print
        
def printHspAlignment(hspD):
    line_length = 60
    query = hspD['query']
    midline = hspD['midline']
    hitseq = hspD['hitseq']
    for i in range(0,len(query),line_length):
        print query[i:i+line_length]
        print midline[i:i+line_length]
        print hitseq[i:i+line_length]

fn = 'blast_test.xml'
tree = ET.parse(fn)
hitL = parseSingleIteration(tree)
showHitList(hitL[:3])

Here is the output:

$ python parseBlastXML.py 
hit # 1
Hit_id lcl|s2
Hit_def No definition line found
Hit_accession s2
s2         26/29   89.66
TGCAGCTTGAG-CACAGGTTAGAGCCTTC
||||||||||| ||||||  |||||||||
TGCAGCTTGAGCCACAGGAGAGAGCCTTC

hit # 2
Hit_id lcl|s3
Hit_def No definition line found
Hit_accession s3
s3         12/12  100.00
TGCAGCTTGAGC
||||||||||||
TGCAGCTTGAGC

hit # 3
Hit_id lcl|s1
Hit_def No definition line found
Hit_accession s1
s1         12/12  100.00
TGCAGCTTGAGC
||||||||||||
TGCAGCTTGAGC