Thursday, May 31, 2012

Burroughs-Wheeler Transform-BWT (2)

I'm exploring the Burroughs-Wheeler Transform, which has become very important for fast methods of matching short reads to large genomes. The first post is here.

I found a nice page about it here, which is to be part of a series. There is a great applet(?), anyhow a tool for visualization is on that page, that can walk you through the steps of the search. I found it very helpful for understanding what's going on. I hope my explanation is also useful.

With the output from the Python script from yesterday, we'll try using BWT to search for the query GGATC in the text:

GAATTCAAGCTTGGATCCGGAAAGATCTGATC

We'll work from the end of the query backwords. The rows we're working on can be found in the output from the script, which is listed at the end of this post.

Q1 = C.
Begin by selecting all the rows that start with C:

 5 CAAGCTTGGATCCGGAAAGATCTGATC^GAATT
16 CCGGAAAGATCTGATC^GAATTCAAGCTTGGAT
17 CGGAAAGATCTGATC^GAATTCAAGCTTGGATC
26 CTGATC^GAATTCAAGCTTGGATCCGGAAAGAT
 9 CTTGGATCCGGAAAGATCTGATC^GAATTCAAG
31 C^GAATTCAAGCTTGGATCCGGAAAGATCTGAT

Q2 = T
So next, restricted to this selection,
find all the positions in the BWT (last column) that contain T.
The first one is

5 CAAGCTTGGATCCGGAAAGATCTGATC^GAATT

Within the BWT as a whole, this is the first T.
There are four T's in the BWT in the range selected.
So our range is now 1-4.

Next, go to the rows starting with T,
that are of rank 1-4 with respect to T:

 4 TCAAGCTTGGATCCGGAAAGATCTGATC^GAAT
15 TCCGGAAAGATCTGATC^GAATTCAAGCTTGGA
25 TCTGATC^GAATTCAAGCTTGGATCCGGAAAGA
30 TC^GAATTCAAGCTTGGATCCGGAAAGATCTGA

Q3 = A.
Find all the rows with A in the BWT for this range.
The first one is:

15 TCCGGAAAGATCTGATC^GAATTCAAGCTTGGA

Within the BWT as a whole, this is the seventh A.
There are three A's in the BWT in the range selected.
So our range is now 7-9.

Next, go to the rows starting with A,
that are of rank 7-9 with respect to A:

14 ATCCGGAAAGATCTGATC^GAATTCAAGCTTGG
24 ATCTGATC^GAATTCAAGCTTGGATCCGGAAAG
29 ATC^GAATTCAAGCTTGGATCCGGAAAGATCTG

Q4 = G.

Find all the rows with G in the BWT for this range.
That would be all of them.

The first one is:

14 ATCCGGAAAGATCTGATC^GAATTCAAGCTTGG

Within the BWT as a whole, this is the 3rd G.
So our range is now 3-5.

Finally, go to the rows starting with G,
that are of rank 3-5 with respect to G:

13 GATCCGGAAAGATCTGATC^GAATTCAAGCTTG
23 GATCTGATC^GAATTCAAGCTTGGATCCGGAAA
28 GATC^GAATTCAAGCTTGGATCCGGAAAGATCT

Q5 is G.
There is one row with a match.
Since is the last letter of the query, we're done.

Our match has index 13.

That's the position of the query GATCC in the text:

GAATTCAAGCTTGGATCCGGAAAGATCTGATC
             + 

using zero-based indexing.

Notice that we don't use the whole table, and we don't use the suffix array (the indexes) until the end. Instead, all we use is the BWT and the sorted characters (actually, the index of the first A, C, etc.):

txt GAATTCAAGCTTGGATCCGGAAAGATCTGATC^

BWT GACGAAGGGATTCTGTG^GATACTTAAACTACC
chr AAAAAAAAAACCCCCCGGGGGGGGTTTTTTTT^

One more thing. If we have a selected range in the BWT and are finding say, all the G's from index 6-10, we need to find quickly how many G's come before that position in the BWT.

Not using the whole table is important, because the BWT has to be memory-efficient. It would not be useful if we required memory proportional to the square of the genome size.

I think it might be worth it to work the whole example again, using just the BWT and the sorted characters:

[ NOTE: Blogger keeps screwing with the formatting. Please let me know if it doesn't seems correct later on. ]

BWT GACGAAGGGATTCTGTG^GATACTTAAACTACC
chr AAAAAAAAAACCCCCCGGGGGGGGTTTTTTTT^

Q1 = C
Q2 = T
          +
GACGAAGGGATTCTGTG^GATACTTAAACTACC
AAAAAAAAAACCCCCCGGGGGGGGTTTTTTTT^
          ------

The indicated T is T1 in the BWT
R = 1-4

Go to T1-T4

Q3 = A

     +  ++   +         + +   +
BWT GACGAAGGGATTCTGTG^GATACTTAAACTACC
chr AAAAAAAAAACCCCCCGGGGGGGGTTTTTTTT^
                            ----

The indicated A is A7 in the BWT
R = 7-9

Go to A7-A9

Q4 = G

    +  +  +
BWT GACGAAGGGATTCTGTG^GATACTTAAACTACC
chr AAAAAAAAAACCCCCCGGGGGGGGTTTTTTTT^
          ---

The indicated G is G3 in the BWT
R = 3-5

Go to G3-G5

Q5 = G

    +  +  +++     + + +   
BWT GACGAAGGGATTCTGTG^GATACTTAAACTACC
chr AAAAAAAAAACCCCCCGGGGGGGGTTTTTTTT^
                      ---

The indicated G is G8 in the BWT
We're done.

The index (see below) is 13 for the match.

output from bwt.py
seq
   GAATTCAAGCTTGGATCCGGAAAGATCTGATC^

20 AAAGATCTGATC^GAATTCAAGCTTGGATCCGG
21 AAGATCTGATC^GAATTCAAGCTTGGATCCGGA
 6 AAGCTTGGATCCGGAAAGATCTGATC^GAATTC
 1 AATTCAAGCTTGGATCCGGAAAGATCTGATC^G
22 AGATCTGATC^GAATTCAAGCTTGGATCCGGAA
 7 AGCTTGGATCCGGAAAGATCTGATC^GAATTCA
14 ATCCGGAAAGATCTGATC^GAATTCAAGCTTGG
24 ATCTGATC^GAATTCAAGCTTGGATCCGGAAAG
29 ATC^GAATTCAAGCTTGGATCCGGAAAGATCTG
 2 ATTCAAGCTTGGATCCGGAAAGATCTGATC^GA
 5 CAAGCTTGGATCCGGAAAGATCTGATC^GAATT
16 CCGGAAAGATCTGATC^GAATTCAAGCTTGGAT
17 CGGAAAGATCTGATC^GAATTCAAGCTTGGATC
26 CTGATC^GAATTCAAGCTTGGATCCGGAAAGAT
 9 CTTGGATCCGGAAAGATCTGATC^GAATTCAAG
31 C^GAATTCAAGCTTGGATCCGGAAAGATCTGAT
19 GAAAGATCTGATC^GAATTCAAGCTTGGATCCG
 0 GAATTCAAGCTTGGATCCGGAAAGATCTGATC^
13 GATCCGGAAAGATCTGATC^GAATTCAAGCTTG
23 GATCTGATC^GAATTCAAGCTTGGATCCGGAAA
28 GATC^GAATTCAAGCTTGGATCCGGAAAGATCT
 8 GCTTGGATCCGGAAAGATCTGATC^GAATTCAA
18 GGAAAGATCTGATC^GAATTCAAGCTTGGATCC
12 GGATCCGGAAAGATCTGATC^GAATTCAAGCTT
 4 TCAAGCTTGGATCCGGAAAGATCTGATC^GAAT
15 TCCGGAAAGATCTGATC^GAATTCAAGCTTGGA
25 TCTGATC^GAATTCAAGCTTGGATCCGGAAAGA
30 TC^GAATTCAAGCTTGGATCCGGAAAGATCTGA
27 TGATC^GAATTCAAGCTTGGATCCGGAAAGATC
11 TGGATCCGGAAAGATCTGATC^GAATTCAAGCT
 3 TTCAAGCTTGGATCCGGAAAGATCTGATC^GAA
10 TTGGATCCGGAAAGATCTGATC^GAATTCAAGC
32 ^GAATTCAAGCTTGGATCCGGAAAGATCTGATC