Thursday, May 31, 2012

Burroughs-Wheeler Transform-BWT (3)

I'm exploring the Burroughs-Wheeler Transform, which has become important for fast methods of matching short reads to large genomes. The first two posts are here and here.

In the first we saw what BWT is and showed how to reverse or invert it. In the second, we saw how to use BWT to do a search. Here, I show a fairly efficient way to generate the BWT for a sequence.

Last time, we used a script which does a rotation on a deque followed by sorting to generate the BWT of a sequence:

GAATTCAAGCTTGGATCCGGAAAGATCTGATC

and obtained

GACGAAGGGATTCTGTG^GATACTTAAACTACC

The result matches what the visualization tool on this page generates. However, our script does a lot more work (and uses more memory) than is really necessary. If we look again at the first few lines of output from bwt.py:

20 AAAGATCTGATC^GAATTCAAGCTTGGATCCGG
21 AAGATCTGATC^GAATTCAAGCTTGGATCCGGA
 6 AAGCTTGGATCCGGAAAGATCTGATC^GAATTC

Each number is the index into the sequence where the row (the suffix) begins (0-based indexing). The array of numbers from top to bottom of the output is the suffix array.

As you can see from the last part of the script below, the BWT is easily generated from the suffix array, it's just text[j-1] for each j in the array. That's because the BWT is the last column and the sequences wrap, so each character in the BWT is -1 with respect to the character in the first column for any row.

The Python script shown here sorts the rows but only looks at as many of the characters as are needed to do the sort. We have a list L to contain the growing suffix array. This consists of indexes j into the text, one corresponding to each row, which are in sorted order in the suffix array.

We move an index i along the text and for the suffix starting at each i, find the place where it belongs in the suffix array, then insert it.

I'm sure there are even faster ways to generate the BWT for a text. The most glaring problem with this attempt is that repeated insertion into a list is expensive. We need a data structure that combines ease of insertion with ease of traversal (like a linked list).

Output:

> python fast_bwt3.py 
GACGAAGGGATTCTGTGGATACTTAAACTACC 

 20  21   6   1  22   7  14  24  29   2
  5  16  17  26   9  31  19  13  23  28
  8  18  12   4  15  25  30  27  11   3
 10  32

If you compare this with what we got last time you'll see it matches.

fast_bwt.py

#text = 'BANANA'
text = 'GAATTCAAGCTTGGATCCGGAAAGATCTGATC' 
t = text + '^'
N = len(t)

# L is the suffix array
L = list()

# i: index to walk current suffix along text
# j: indexes for sorted suffix already in suffix array
# k: enumeration of L, to give insertion point

def doOne(i,L):
    for k,j in enumerate(L):
        n = 0
        while t[i+n] == t[j+n]:
            n += 1
        if t[i+n] < t[j+n]:
            L.insert(k,i)
            return
    L.append(i)

for i in range(1,N):
    doOne(i,L)

bwt = ''.join([t[j-1] for j in L])
print bwt, '\n'
for k,j in enumerate(L):
    if k and not k % 10:
        print
    print '%3d' % j,

1 comment:

Will said...

It seems I am embarking on much the same quest as you!

http://williamedwardscoder.tumblr.com/post/24560795584/searching-for-fuzzy-substrings-in-a-massive-string

Loving your series.

You may like libdivsufsort http://code.google.com/p/libdivsufsort/