Wednesday, May 30, 2012

Burroughs-Wheeler Transform-BWT (1)

Some time ago I was introduced to bowtie (see this post).

As the front page says:

Bowtie is an ultrafast, memory-efficient short read aligner. It aligns short DNA sequences (reads) to the human genome at a rate of over 25 million 35-bp reads per hour. Bowtie indexes the genome with a Burrows-Wheeler index to keep its memory footprint small: typically about 2.2 GB for the human genome (2.9 GB for paired-end).


So the next question is: what is the Burroughs-Wheeler Transform (BWT) and how does it speed up this process? It turns out that the discovery of BWT was not all that long ago (1994 according to wikipedia), and its application to this bioinformatics problem is more recent than that. BWT has been applied to the problem of data compression in the well known bzip tool. We'll talk about the compression part today, and deal with search in another post.

I wrote a simple Python script to explore the BWT; the listing is at the end of the post. The basic transformation couldn't be easier. Take a short text, like 'BANANA', and add a marker to the end, giving 'BANANA^'.

[UPDATE: I chose '^' because it sorts after the uppercase letters by the default sort. This may grate a little for regex pros who expect '^' to mark the beginning of something. Just define your own cmp function if you are one of these. ]

Generate all the rotational permutations of this text:

BANANA^
^BANANA
A^BANAN
NA^BANA
ANA^BAN
NANA^BA
ANANA^B


Now sort the list:

ANANA^B
ANA^BAN
A^BANAN
BANANA^
NANA^BA
NA^BANA
^BANANA


That's it! The last column, 'BNN^AAA' is the BWT of 'BANANA'. I think you can see that the runs of identical symbols would allow opportunity for efficient compression.

But here's the main point: by itself, the last column contains all of the information that was in the original text. That is, the transformation is reversible or invertible. The inversion depends on the cyclic rotation.

Here are the steps. Take the BWT and sort it:

A
A
A
B
N
N
^


By the way, that's what we already have in column 1 (or 0, for you Pythonistas).

Because of the rotation, the BWT plus column 1 gives all pairs of symbols in the original text. The symbol pairs are (BWT first, then column 1):

BA
NA
NA
^B
AN
AN
A^


Now iteratively do this: sort whatever we generated in the previous step, and then tack on the BWT in front. Here are the triplets:

BAN
NAN
NA^
^BA
ANA
ANA
A^B


Rinse, lather and repeat:

BANA
NANA
NA^B
^BAN
ANAN
ANA^
A^BA

BANAN
NANA^
NA^BA
^BANA
ANANA
ANA^B
A^BAN

BANANA
NANA^B
NA^BAN
^BANAN
ANANA^
ANA^BA
A^BANA


Although 'BANANA' is a nice simple example to begin with, in some respects it is too simple. For example:

> python bwt.py ACGTGAATTCGAAACCGGAA
11 AAACCGGAA^ACGTGAATTCG
12 AACCGGAA^ACGTGAATTCGA
 5 AATTCGAAACCGGAA^ACGTG
18 AA^ACGTGAATTCGAAACCGG
13 ACCGGAA^ACGTGAATTCGAA
 0 ACGTGAATTCGAAACCGGAA^
 6 ATTCGAAACCGGAA^ACGTGA
19 A^ACGTGAATTCGAAACCGGA
14 CCGGAA^ACGTGAATTCGAAA
 9 CGAAACCGGAA^ACGTGAATT
15 CGGAA^ACGTGAATTCGAAAC
 1 CGTGAATTCGAAACCGGAA^A
10 GAAACCGGAA^ACGTGAATTC
 4 GAATTCGAAACCGGAA^ACGT
17 GAA^ACGTGAATTCGAAACCG
16 GGAA^ACGTGAATTCGAAACC
 2 GTGAATTCGAAACCGGAA^AC
 8 TCGAAACCGGAA^ACGTGAAT
 3 TGAATTCGAAACCGGAA^ACG
 7 TTCGAAACCGGAA^ACGTGAA
20 ^ACGTGAATTCGAAACCGGAA

The BWT (last column) does not group all identical letters together. And in the inversion step, the reconstructed text is not necessarily in the first row.

In the output above, I've also listed the "suffix array", which is the index into the original text where we start each row. The suffix array will become important for the application to search, next time.

bwt.py
import sys
from collections import deque

try: 
    text = sys.argv[1]
except IndexError:
    text = 'BANANA'

text += '^'
n = len(text)
#---------------------------------------------
# BWT

rL = list()
dq = deque(text)
for i in range(n):
    rL.append(''.join(list(dq)))
    dq.rotate(1)
rL.sort()

bwt = [s[-1] for s in rL]
sfx = [(n - e.index('^') - 1) for e in rL]

for s,e in zip(sfx,rL):
    print '%2d' % s,
    print e
#---------------------------------------------
# reverse
# take the last col and sort to generate col 0

c0 = sorted(bwt)
print '\n'.join(c0)

# get pairs as a demo:
c1 = [bwt[i]+c0[i] for i in range(n)]
c1.sort()
print '\n'.join(c1)

# iterate starting with bwt again:
L = sorted(bwt)
for j in range(n-2):
    L.sort()
    L = [bwt[i]+L[i] for i in range(n)]
    print '\n'.join(L) + '\n'
    
L = [e for e in L if not '^' in e]
assert ''.join(L[0]) == text[:-1]