As the front page says:
Bowtie is an ultrafast, memory-efficient short read aligner. It aligns short DNA sequences (reads) to the human genome at a rate of over 25 million 35-bp reads per hour. Bowtie indexes the genome with a Burrows-Wheeler index to keep its memory footprint small: typically about 2.2 GB for the human genome (2.9 GB for paired-end).
So the next question is: what is the Burroughs-Wheeler Transform (BWT) and how does it speed up this process? It turns out that the discovery of BWT was not all that long ago (1994 according to wikipedia), and its application to this bioinformatics problem is more recent than that. BWT has been applied to the problem of data compression in the well known bzip tool. We'll talk about the compression part today, and deal with search in another post.
I wrote a simple Python script to explore the BWT; the listing is at the end of the post. The basic transformation couldn't be easier. Take a short text, like
'BANANA', and add a marker to the end, giving
[UPDATE: I chose
'^'because it sorts after the uppercase letters by the default sort. This may grate a little for regex pros who expect
'^'to mark the beginning of something. Just define your own cmp function if you are one of these. ]
Generate all the rotational permutations of this text:
Now sort the list:
That's it! The last column,
'BNN^AAA'is the BWT of
'BANANA'. I think you can see that the runs of identical symbols would allow opportunity for efficient compression.
But here's the main point: by itself, the last column contains all of the information that was in the original text. That is, the transformation is reversible or invertible. The inversion depends on the cyclic rotation.
Here are the steps. Take the BWT and sort it:
By the way, that's what we already have in column 1 (or 0, for you Pythonistas).
Because of the rotation, the BWT plus column 1 gives all pairs of symbols in the original text. The symbol pairs are (BWT first, then column 1):
Now iteratively do this: sort whatever we generated in the previous step, and then tack on the BWT in front. Here are the triplets:
Rinse, lather and repeat:
'BANANA'is a nice simple example to begin with, in some respects it is too simple. For example:
> python bwt.py ACGTGAATTCGAAACCGGAA 11 AAACCGGAA^ACGTGAATTCG 12 AACCGGAA^ACGTGAATTCGA 5 AATTCGAAACCGGAA^ACGTG 18 AA^ACGTGAATTCGAAACCGG 13 ACCGGAA^ACGTGAATTCGAA 0 ACGTGAATTCGAAACCGGAA^ 6 ATTCGAAACCGGAA^ACGTGA 19 A^ACGTGAATTCGAAACCGGA 14 CCGGAA^ACGTGAATTCGAAA 9 CGAAACCGGAA^ACGTGAATT 15 CGGAA^ACGTGAATTCGAAAC 1 CGTGAATTCGAAACCGGAA^A 10 GAAACCGGAA^ACGTGAATTC 4 GAATTCGAAACCGGAA^ACGT 17 GAA^ACGTGAATTCGAAACCG 16 GGAA^ACGTGAATTCGAAACC 2 GTGAATTCGAAACCGGAA^AC 8 TCGAAACCGGAA^ACGTGAAT 3 TGAATTCGAAACCGGAA^ACG 7 TTCGAAACCGGAA^ACGTGAA 20 ^ACGTGAATTCGAAACCGGAA
The BWT (last column) does not group all identical letters together. And in the inversion step, the reconstructed text is not necessarily in the first row.
In the output above, I've also listed the "suffix array", which is the index into the original text where we start each row. The suffix array will become important for the application to search, next time.
import sys from collections import deque try: text = sys.argv except IndexError: text = 'BANANA' text += '^' n = len(text) #--------------------------------------------- # BWT rL = list() dq = deque(text) for i in range(n): rL.append(''.join(list(dq))) dq.rotate(1) rL.sort() bwt = [s[-1] for s in rL] sfx = [(n - e.index('^') - 1) for e in rL] for s,e in zip(sfx,rL): print '%2d' % s, print e #--------------------------------------------- # reverse # take the last col and sort to generate col 0 c0 = sorted(bwt) print '\n'.join(c0) # get pairs as a demo: c1 = [bwt[i]+c0[i] for i in range(n)] c1.sort() print '\n'.join(c1) # iterate starting with bwt again: L = sorted(bwt) for j in range(n-2): L.sort() L = [bwt[i]+L[i] for i in range(n)] print '\n'.join(L) + '\n' L = [e for e in L if not '^' in e] assert ''.join(L) == text[:-1]