Wednesday, May 7, 2008

Python for BLAST XML

I think it was probably the Biopython tutorial that first made me aware of NCBI's policy: use XML (or ASN.1) if you want us not to break your code. They change HTML and plain text output at will in order to improve the experience for a human reading it, and there is no guarantee when things might change again. (I couldn't find a page at NCBI to cite in a quick search but I remember seeing this disclaimer). Scripted interaction with NCBI is supposed to use eutils, many of the eutils (like ESearch) return only XML, while others (like EFetch) respond to rettype=fasta or rettype=text. [This is the executive summary, using WebEnv one can gain more control over the output, but that is another post.] NCBI makes extensive use of another format called ASN.1 but have surrendered to the popularity of XML for data exchange over the web.

XML can be a complicated beast, however NCBI XML documents seem pretty regular. If we have saved BLAST results in XML format, how can we parse the result? One way is to use Biopython. There is a good description of how to do this in the Biopython tutorial (see the link above). This is what I have been doing. However, there is a really lightweight and simple alternative. As of version 2.5, stock Python comes with at least part of the ElementTree library. Additional explanation may be found here, here, and here.

I stripped out most of a BLAST results XML file to give a simplified overview. It's a hierarchy of open and close tags. The part of the hierarchy we are interested in has data like this:


There is a separate Iteration for each FASTA formatted sequence in the input--i.e. each separate BLAST query. There is a Hit for each sequence in the database that shows a match. And any individual Hit may have more than one section that matches, these are called Hsp for high-scoring segment pair. We navigate down the tree until we find what we're looking for. We can jump levels: e.g. given a hit we do

for hsp in hit.findall('Hit_hsps'):
    identities = hsp.findtext('Hsp/Hsp_identity')


Here is a script to parse an example file (remember to change the file extension if you want to run it). And this is what it prints:

blastn
gi|74145426|gb|DQ174269.1|
Achromobacter xylosoxidans 16S ribosomal RNA gene, partial sequence
DQ174269
id: 476 len: 476 %id: 100.0
ATTGAACGCTAGCGGGATGCCTTACACATGCAAGTCGAACGGCAGCACGG
||||||||||||||||||||||||||||||||||||||||||||||||||
ATTGAACGCTAGCGGGATGCCTTACACATGCAAGTCGAACGGCAGCACGG

gi|15384334|gb|AF411020.1|
Achromobacter xylosoxidans strain AU1011 16S ribosomal RNA gene, partial sequence
AF411020
id: 476 len: 476 %id: 100.0
ATTGAACGCTAGCGGGATGCCTTACACATGCAAGTCGAACGGCAGCACGG
||||||||||||||||||||||||||||||||||||||||||||||||||
ATTGAACGCTAGCGGGATGCCTTACACATGCAAGTCGAACGGCAGCACGG

I'm sure you can take it from there. Here is a page I wrote a while back about parsing Medline using ElementTree. It has added eye candy.