Python for Bioinformatics: The need for speed: BLAT

I started testing alternatives to BLAST for aligning Illumina reads to a genomic sequence (Haemophilus influenzae L42023.1). When first setting up the project, I should have remembered how much faster BLAT is than BLAST. But I only needed to run the search once. I just went and made myself a sandwich.

Now it's time to do a little testing. I made a sample file test.txt with 10,000 sequences. I should really do command line testing (as we discussed here), but to make things easy I wrote a script to run the shell commands and print the time elapsed.

The actual commands are:

blat Hinf.fna test.txt blat.results.txt -out=blast8

megablast -d Hinf.fna -i test.txt -e 10 -m 8  > blast.result

And the output is:

blat
Loaded 1830138 letters in 1 sequences
Searched 529941 bases in 10000 sequences
1.1
megablast
16.9

BLAT is certainly faster than BLAST. Time might be the deciding factor if we do this every day, or if we have 2e7 reads. Another important issue is accuracy. We won't address that directly, but we should at least compare the results from the two methods. I wrote a second script to compare the output. The first two lines show that BLAT found matches for 607 sequences that BLAST missed (though the converse is true for a smaller number of sequences).

We investigate sequences where the length of the alignment was different. There are lots of these (922). A common pattern is for BLAST to report the same alignment, but trim it at the end of the read (presumably due to a mismatch). That suggests we might profitably trim the ends of all the reads before doing the alignment (say, to 40 nt---it would still be plenty). At least, BLAST reported the hit. But it's messing up the quality filter.

There is a rarer pattern where BLAT reports a match that is extremely short (see the fourth and seventh items in the list).

> python compare.py 
blat  8670
blast 8099
blat  only:   607
blast only:    36
351739
blast (1818932, 1818984) 53
blat  (626892, 626840) 53
295852
blast (937875, 937843) 33
blat  (937895, 937843) 53
514095
blast (545851, 545900) 50
blat  (545851, 545903) 53
514912
blast (677240, 677292) 53
blat  (677542, 677547) 6
518007
blast (580488, 580540) 53
blat  (580493, 580540) 48
517931
blast (493879, 493929) 51
blat  (493879, 493931) 53
462351
blast (46058, 46110) 53
blat  (46152, 46157) 6

That's a real problem because we're filtering the alignments for nearly full-length. I looked at the first one, it is a repeat.

BLAST:

514912 Hinf 100.00 53 0 0 1 53 760633 760685 1e-24  105
514912 Hinf 98.11 53 1 0 1 53 677240 677292 3e-22 97.6

BLAT

514912 Hinf 100.00 53 0 0 1 53 760633 760685 7.6e-22 102.0
514912 Hinf 100.00 47 0 0 1 47 677240 677286 2.3e-18 90.0
514912 Hinf 100.00 6 0 0 48 53 677542 677547 9.9e+05 12.0

OK..
This one is my fault. It is not the repeat that's causing trouble. BLAT split a single match into two pieces. I saved the results in a dictionary, and the duplicated key caused the other result to be discarded. My parsing code needs to be smarter and look out for this situation.

Here are the alignments from NCBI's web BLAST, so you can see the mismatch:

Query  1       TAGGAAAGATAATAATCTTTATCTATCAGATAATCTCGAGCATCTTTTGTTTC  53
               |||||||||||||||||||||||||||||||||||||||||||||||||||||
Sbjct  760633  TAGGAAAGATAATAATCTTTATCTATCAGATAATCTCGAGCATCTTTTGTTTC  760685


Query  1       TAGGAAAGATAATAATCTTTATCTATCAGATAATCTCGAGCATCTTTTGTTTC  53
               ||||||||||||||||||||||||||||||||||||||||||||||| |||||
Sbjct  677240  TAGGAAAGATAATAATCTTTATCTATCAGATAATCTCGAGCATCTTTAGTTTC  677292

script.py

import os, time, subprocess
L = ['blat','Hinf.fna','test.txt',
      'blat.results.txt','-out=blast8']
cmd = ' '.join(L)

print L[0]
start = time.time()
p = subprocess.Popen(cmd, shell=True)
pid,ecode = os.waitpid(p.pid, 0)
print round(time.time() - start,1)

L = ['megablast','-d Hinf.fna','-i test.txt',
     '-e 10','-m 8',' > blast.results.txt']
cmd = ' '.join(L)

start = time.time()
print L[0]
p = subprocess.Popen(cmd, shell=True)
pid,ecode = os.waitpid(p.pid, 0)
print round(time.time() - start,1)

compare.py

from utils import load_data

def make_dict(fn):
    N = 51
    data = load_data(fn)
    data = data.strip().split('\n')
    D = dict()
    for item in data:
        L = item.strip().split()
        title = L[0]
        align = L[3]
        if align < N:  continue
        i,j = L[8:10]
        D[title] = (int(i),int(j))
    return D
        
blast_dict = make_dict('blast.results.txt')
blat_dict = make_dict('blat.results.txt')
print 'blat ', len(blat_dict)
print 'blast', len(blast_dict)

blat_only = [k for k in blat_dict if not k in blast_dict]
print 'blat  only:  ', len(blat_only)

blast_only = [k for k in blast_dict if not k in blat_dict]
print 'blast only:   ', len(blast_only)

L = set(blat_dict.keys()).intersection(set(blast_dict.keys()))
L = list(L)

count = 0
for e in L:
    if count > 6:  break
    if not blast_dict[e] == blat_dict[e]:
        print e
        t = blast_dict[e]
        print 'blast', t, abs(t[0]-t[1]) + 1
        t = blat_dict[e]
        print 'blat ', t, abs(t[0]-t[1]) + 1
        count += 1
#print count

Saturday, March 5, 2011

The need for speed: BLAT