Now it's time to do a little testing. I made a sample file
test.txt
with 10,000 sequences. I should really do command line testing (as we discussed here), but to make things easy I wrote a script to run the shell commands and print the time elapsed. The actual commands are:
And the output is:
BLAT is certainly faster than BLAST. Time might be the deciding factor if we do this every day, or if we have 2e7 reads. Another important issue is accuracy. We won't address that directly, but we should at least compare the results from the two methods. I wrote a second script to compare the output. The first two lines show that BLAT found matches for 607 sequences that BLAST missed (though the converse is true for a smaller number of sequences).
We investigate sequences where the length of the alignment was different. There are lots of these (922). A common pattern is for BLAST to report the same alignment, but trim it at the end of the read (presumably due to a mismatch). That suggests we might profitably trim the ends of all the reads before doing the alignment (say, to 40 nt---it would still be plenty). At least, BLAST reported the hit. But it's messing up the quality filter.
There is a rarer pattern where BLAT reports a match that is extremely short (see the fourth and seventh items in the list).
That's a real problem because we're filtering the alignments for nearly full-length. I looked at the first one, it is a repeat.
OK..
This one is my fault. It is not the repeat that's causing trouble. BLAT split a single match into two pieces. I saved the results in a dictionary, and the duplicated key caused the other result to be discarded. My parsing code needs to be smarter and look out for this situation.
Here are the alignments from NCBI's web BLAST, so you can see the mismatch:
script.py
compare.py