Friday, May 9, 2008

Biopython for BLAST

I think Biopython is very nice. My only complaints are small ones (and since it's open source, developed and maintained by people working on their own time, complaints are not really fair or appreciated). Still, I sometimes compare the situation with R. R has a beautiful GUI with a "Package Manager" which will automatically download and install not only a package (similar to Biopython) but all of its dependencies as well. Instead, with Biopython I have to go out and load the dependencies for myself (and make sure I get the correct version, which is not always the latest one). In my half-dozen or so installations of Biopython, sometimes this has been easy, and sometimes not. Still, I am *very* grateful on the whole.

I've used Biopython to BLAST a lot, importing NCBIWWW from Bio.Blast. It looks well-maintained and says that it supports all of the extensive API advertised by NCBI for BLAST.

I wanted to understand how it works, so I decided to dissect it. The first thing I did was to make a copy of NCBIWWW.py on my Desktop and put in some print statements to look at various values. After playing with that for a while, I copied the key function and started pruning until I had only the code required for it to work. My minimal example is here.

Actually, it is not completely minimal because it still uses urllib.urlencode(), and urllib2.Request() rather than just calling urllib2.open(url) as we did for fetching FASTA sequences from Genbank. The explanation of how these work promises to be here but I haven't looked at that much yet.

In any case, the Biopython code follows the sequence designed by NCBI where we first send them our request called 'PUT', then they send us back an 'RID'. In the second phase, we poll them periodically with a 'GET' request with the 'RID' value. If the search has not been completed, the result we get is actually an html page and 'Status=waiting' is present in the page source. If we have specified FORMAT_TYPE=XML and the search has finished, then the result is actually the final xml page, and 'Status=' is not found. If we have not specified XML, then we get html. If saved to disk, it won't have the images (why not?), but it will display in a browser.

The BLAST interface used by Biopython is apparently the standard Blast.cgi you would contact with a browser, and the results are *not* XML by default, as I had thought, they are HTML. I got confused somehow a long time ago; as I remember, I thought that you are supposed to set the -m 7 flag for the standalone BLAST program to get XML output, which was explained as ALIGNMENT_VIEW. But an ALIGNMENT_VIEW is different. And looking at the docs for standalone BLAST, -m is not even listed as valid input. This is likely to remain a mystery---another senior moment.

The minimal requirements are pretty minimal. For the first part we need the database, program, and sequence. For the second part we need the RID, and it helps to tell them we want XML. That's it.