Thursday, August 9, 2012

See ya

Posting has become much rarer here recently, as my energy has flagged.

And this morning I have a rather preemptory request that I fix the old links that were broken as a result of Apple's ditching iDisk. While I still have the files, somewhere, and I could theoretically put them on Dropbox, I'm really not ready to fix what must be several hundred links.

So I thought I would post this warning as what will probably be my last post. It's been fun, and I'm grateful to my readers for having taken an interest. We had more than 50,000 unique visitors and over 150 flags. I just wish someone from Tonga had come by.

If you really must have an old file, I can probably find it.

[ UPDATE: I made a zip of the directory tree which was used to be on iDisk under
http://web.mac.com/telliott99/Python/

It is on Dropbox here and weighs in at 17.8 MB. If you check the old links that don't work, the file should be in that archive. Let me know if you can't find something you'd like.

Tuesday, July 31, 2012

More fun with geometry

I ran into a fun problem with parallelograms the other day.


The graphic shows a parallelogram broken up into pieces by its diagonals.

Since the opposing sides of a parallelogram are parallel and equal, it's easy to show that the top and bottom triangles are congruent, just rotated, likewise with the other two.

The problem posed was to express the total area in terms of the obtuse angle at the center, the one which is > 90 degrees. Let's call that angle A, and its supplementary angle (the acute angle at the center) B.

The reason why I've drawn the figure in this way is that, as shown below, the triangles can be rearranged to give a different paralellogram with the same area.



x = length of long diagonal (orange + maroon)
y = length of short diagonal
y/2 = 1/2 length of short diagonal (blue)

Area = x (y/2) sin B

The sine values for supplementary angles are equal.

sin (π/2 - θ) = sin θ (ref)

So:

Area = x (y/2) sin A

Nice! Much better than using the law of sines, and for the vectorists out there, expressed even more compactly as the cross-product.

Saturday, June 16, 2012

variadic arguments

I got curious about functions with variadic arguments like in NSArray:

+ (id)arrayWithObjects:(id)firstObj, ...


Here's are two C examples. In either case, it's required to have a named argument that precedes the variable part. In the first method, that variable gives a count of the arguments. In the second, it's not used for anything except to show where to start. A sentinel value shows the end of the variable length argument list. I also incorporated __func__, which I came across in my reading and was new to me.

// clang variadic.c -o prog
#include <stdio.h>
#include <stdarg.h>

int f(int count, ...)
{
  printf("%s%s", __func__, "\n");
  int n, sum; 
  va_list ap;
  va_start(ap, count);
  sum = 0;
  for ( ; count > 0; count--) {
      n = va_arg(ap, int);
      printf("n = %3d\n", n);
      sum += n;
  }
  va_end(ap);
  return sum;
}

int g(int x, ...)
{
  printf("%s%s", __func__, "\n");
  fflush(stdout);
  int n, sum; 
  va_list ap;
  va_start(ap, x);
  n = x;
  sum = 0;
  while (!(0)) {
      printf("n = %3d\n", n);
      sum += n;
      n = va_arg(ap, int);
      if (n == 0) { break; }
  }
  va_end(ap);
  return sum;
}

int main(int argc, char * argv[]) {
    int result = f(3,1,2,3);
    printf ("sum = %d\n", result);
    result = g(1,2,3,0);
    printf ("sum = %d\n", result);
    return 0;
}

> ./prog
f
n =   1
n =   2
n =   3
sum = 6
g
n =   1
n =   2
n =   3
sum = 6

NSString (1)



Test harness:

// clang strings1.m -o prog -framework Foundation -fobjc-gc-only
#include <Foundation/Foundation.h>

int main(int argc, char * argv[]) {

    // ..code here..

    return 0;
}

NSString *s;
    NSMutableArray *ma = [NSMutableArray arrayWithCapacity:10];

    char *cstr = "abc";   // real C string w/ "\0" also works
    uint ascii = NSASCIIStringEncoding;
    uint utf8 = NSUTF8StringEncoding;
    [ma addObject:[NSString stringWithCString:cstr encoding:ascii]];
    [ma addObject:[NSString stringWithCString:cstr encoding:utf8]];
    [ma addObject:[NSString stringWithUTF8String:cstr]];
    NSMutableString *ms = [[NSMutableString alloc] init];
    [ms appendString:@"abde"];
    [ms insertString:@"c" atIndex:2];
    [ma addObject:[ms copy]];  // otherwise code below will alter
    [ms appendString:@"fghij"];
    [ms replaceOccurrencesOfString:@"B"
                        withString:@"*"
                           options:NSCaseInsensitiveSearch
                             range:NSMakeRange(0,2) ];
    [ma addObject:ms];
    for (id obj in ma) { NSLog(@"%@ %@", obj, [obj class]); }
    [ma removeAllObjects];
    printf("-----------------\n");



> ./prog
2012-05-29 16:45:03.624 prog[444:707] abc __NSCFString
2012-05-29 16:45:03.626 prog[444:707] abc __NSCFString
2012-05-29 16:45:03.627 prog[444:707] abc __NSCFString
2012-05-29 16:45:03.627 prog[444:707] abcde __NSCFString
2012-05-29 16:45:03.628 prog[444:707] a*cdefghij __NSCFString
-----------------

[ma addObject:[NSString stringWithFormat:@"abcd"]];
    [ma addObject:[NSString stringWithFormat:@"ab%@", @"cd" @"ef"]];
    [ma addObject:[NSString stringWithFormat:@"a%C%C", 0x62, 0x63]];
    char b[] = { 0x61, 0x62, 0x63, 0x64, 0x65 };
    NSData *d = [NSData dataWithBytes:b length:5];
    s = [[NSString alloc] initWithData:d encoding:utf8];
    [ma addObject:s];
    for (id obj in ma) { NSLog(@"%@ %@", obj, [obj class]); }
    printf("-----------------\n");

2012-05-29 16:45:03.629 prog[444:707] abcd __NSCFString
2012-05-29 16:45:03.629 prog[444:707] abcdef __NSCFString
2012-05-29 16:45:03.630 prog[444:707] abc __NSCFString
2012-05-29 16:45:03.630 prog[444:707] abcde __NSCFString
-----------------

d = [NSData dataWithBytes:b length:5];
    [d writeToFile:@"x" atomically:YES];
    NSError *err = nil;
    NSStringEncoding enc = NSUTF8StringEncoding;
    s = [NSString stringWithContentsOfFile:@"x" 
                              usedEncoding:&enc
                                     error:&err];
    if (err != nil) { NSLog(@"%@", [err userInfo]); }
    else {  NSLog(@"%@", s );  }

    b[4] = 0xff;
    d = [NSData dataWithBytes:b length:5];
    [d writeToFile:@"x" atomically:YES];
    s = [NSString stringWithContentsOfFile:@"x" 
                              usedEncoding:&enc 
                                     error:&err];
    if (err != nil) { NSLog(@"%@", [err localizedDescription]); } 
    else {  NSLog(@"%@", s );  }

2012-05-29 16:45:03.641 prog[444:707] abcde
2012-05-29 16:45:03.658 prog[444:707] The file “x” couldn’t be opened because the text encoding of its contents can’t be determined.


Thursday, May 31, 2012

Burroughs-Wheeler Transform-BWT (3)

I'm exploring the Burroughs-Wheeler Transform, which has become important for fast methods of matching short reads to large genomes. The first two posts are here and here.

In the first we saw what BWT is and showed how to reverse or invert it. In the second, we saw how to use BWT to do a search. Here, I show a fairly efficient way to generate the BWT for a sequence.

Last time, we used a script which does a rotation on a deque followed by sorting to generate the BWT of a sequence:

GAATTCAAGCTTGGATCCGGAAAGATCTGATC

and obtained

GACGAAGGGATTCTGTG^GATACTTAAACTACC

The result matches what the visualization tool on this page generates. However, our script does a lot more work (and uses more memory) than is really necessary. If we look again at the first few lines of output from bwt.py:

20 AAAGATCTGATC^GAATTCAAGCTTGGATCCGG
21 AAGATCTGATC^GAATTCAAGCTTGGATCCGGA
 6 AAGCTTGGATCCGGAAAGATCTGATC^GAATTC

Each number is the index into the sequence where the row (the suffix) begins (0-based indexing). The array of numbers from top to bottom of the output is the suffix array.

As you can see from the last part of the script below, the BWT is easily generated from the suffix array, it's just text[j-1] for each j in the array. That's because the BWT is the last column and the sequences wrap, so each character in the BWT is -1 with respect to the character in the first column for any row.

The Python script shown here sorts the rows but only looks at as many of the characters as are needed to do the sort. We have a list L to contain the growing suffix array. This consists of indexes j into the text, one corresponding to each row, which are in sorted order in the suffix array.

We move an index i along the text and for the suffix starting at each i, find the place where it belongs in the suffix array, then insert it.

I'm sure there are even faster ways to generate the BWT for a text. The most glaring problem with this attempt is that repeated insertion into a list is expensive. We need a data structure that combines ease of insertion with ease of traversal (like a linked list).

Output:

> python fast_bwt3.py 
GACGAAGGGATTCTGTGGATACTTAAACTACC 

 20  21   6   1  22   7  14  24  29   2
  5  16  17  26   9  31  19  13  23  28
  8  18  12   4  15  25  30  27  11   3
 10  32

If you compare this with what we got last time you'll see it matches.

fast_bwt.py

#text = 'BANANA'
text = 'GAATTCAAGCTTGGATCCGGAAAGATCTGATC' 
t = text + '^'
N = len(t)

# L is the suffix array
L = list()

# i: index to walk current suffix along text
# j: indexes for sorted suffix already in suffix array
# k: enumeration of L, to give insertion point

def doOne(i,L):
    for k,j in enumerate(L):
        n = 0
        while t[i+n] == t[j+n]:
            n += 1
        if t[i+n] < t[j+n]:
            L.insert(k,i)
            return
    L.append(i)

for i in range(1,N):
    doOne(i,L)

bwt = ''.join([t[j-1] for j in L])
print bwt, '\n'
for k,j in enumerate(L):
    if k and not k % 10:
        print
    print '%3d' % j,

Burroughs-Wheeler Transform-BWT (2)

I'm exploring the Burroughs-Wheeler Transform, which has become very important for fast methods of matching short reads to large genomes. The first post is here.

I found a nice page about it here, which is to be part of a series. There is a great applet(?), anyhow a tool for visualization is on that page, that can walk you through the steps of the search. I found it very helpful for understanding what's going on. I hope my explanation is also useful.

With the output from the Python script from yesterday, we'll try using BWT to search for the query GGATC in the text:

GAATTCAAGCTTGGATCCGGAAAGATCTGATC

We'll work from the end of the query backwords. The rows we're working on can be found in the output from the script, which is listed at the end of this post.

Q1 = C.
Begin by selecting all the rows that start with C:

 5 CAAGCTTGGATCCGGAAAGATCTGATC^GAATT
16 CCGGAAAGATCTGATC^GAATTCAAGCTTGGAT
17 CGGAAAGATCTGATC^GAATTCAAGCTTGGATC
26 CTGATC^GAATTCAAGCTTGGATCCGGAAAGAT
 9 CTTGGATCCGGAAAGATCTGATC^GAATTCAAG
31 C^GAATTCAAGCTTGGATCCGGAAAGATCTGAT

Q2 = T
So next, restricted to this selection,
find all the positions in the BWT (last column) that contain T.
The first one is

5 CAAGCTTGGATCCGGAAAGATCTGATC^GAATT

Within the BWT as a whole, this is the first T.
There are four T's in the BWT in the range selected.
So our range is now 1-4.

Next, go to the rows starting with T,
that are of rank 1-4 with respect to T:

 4 TCAAGCTTGGATCCGGAAAGATCTGATC^GAAT
15 TCCGGAAAGATCTGATC^GAATTCAAGCTTGGA
25 TCTGATC^GAATTCAAGCTTGGATCCGGAAAGA
30 TC^GAATTCAAGCTTGGATCCGGAAAGATCTGA

Q3 = A.
Find all the rows with A in the BWT for this range.
The first one is:

15 TCCGGAAAGATCTGATC^GAATTCAAGCTTGGA

Within the BWT as a whole, this is the seventh A.
There are three A's in the BWT in the range selected.
So our range is now 7-9.

Next, go to the rows starting with A,
that are of rank 7-9 with respect to A:

14 ATCCGGAAAGATCTGATC^GAATTCAAGCTTGG
24 ATCTGATC^GAATTCAAGCTTGGATCCGGAAAG
29 ATC^GAATTCAAGCTTGGATCCGGAAAGATCTG

Q4 = G.

Find all the rows with G in the BWT for this range.
That would be all of them.

The first one is:

14 ATCCGGAAAGATCTGATC^GAATTCAAGCTTGG

Within the BWT as a whole, this is the 3rd G.
So our range is now 3-5.

Finally, go to the rows starting with G,
that are of rank 3-5 with respect to G:

13 GATCCGGAAAGATCTGATC^GAATTCAAGCTTG
23 GATCTGATC^GAATTCAAGCTTGGATCCGGAAA
28 GATC^GAATTCAAGCTTGGATCCGGAAAGATCT

Q5 is G.
There is one row with a match.
Since is the last letter of the query, we're done.

Our match has index 13.

That's the position of the query GATCC in the text:

GAATTCAAGCTTGGATCCGGAAAGATCTGATC
             + 

using zero-based indexing.

Notice that we don't use the whole table, and we don't use the suffix array (the indexes) until the end. Instead, all we use is the BWT and the sorted characters (actually, the index of the first A, C, etc.):

txt GAATTCAAGCTTGGATCCGGAAAGATCTGATC^

BWT GACGAAGGGATTCTGTG^GATACTTAAACTACC
chr AAAAAAAAAACCCCCCGGGGGGGGTTTTTTTT^

One more thing. If we have a selected range in the BWT and are finding say, all the G's from index 6-10, we need to find quickly how many G's come before that position in the BWT.

Not using the whole table is important, because the BWT has to be memory-efficient. It would not be useful if we required memory proportional to the square of the genome size.

I think it might be worth it to work the whole example again, using just the BWT and the sorted characters:

[ NOTE: Blogger keeps screwing with the formatting. Please let me know if it doesn't seems correct later on. ]

BWT GACGAAGGGATTCTGTG^GATACTTAAACTACC
chr AAAAAAAAAACCCCCCGGGGGGGGTTTTTTTT^

Q1 = C
Q2 = T
          +
GACGAAGGGATTCTGTG^GATACTTAAACTACC
AAAAAAAAAACCCCCCGGGGGGGGTTTTTTTT^
          ------

The indicated T is T1 in the BWT
R = 1-4

Go to T1-T4

Q3 = A

     +  ++   +         + +   +
BWT GACGAAGGGATTCTGTG^GATACTTAAACTACC
chr AAAAAAAAAACCCCCCGGGGGGGGTTTTTTTT^
                            ----

The indicated A is A7 in the BWT
R = 7-9

Go to A7-A9

Q4 = G

    +  +  +
BWT GACGAAGGGATTCTGTG^GATACTTAAACTACC
chr AAAAAAAAAACCCCCCGGGGGGGGTTTTTTTT^
          ---

The indicated G is G3 in the BWT
R = 3-5

Go to G3-G5

Q5 = G

    +  +  +++     + + +   
BWT GACGAAGGGATTCTGTG^GATACTTAAACTACC
chr AAAAAAAAAACCCCCCGGGGGGGGTTTTTTTT^
                      ---

The indicated G is G8 in the BWT
We're done.

The index (see below) is 13 for the match.

output from bwt.py
seq
   GAATTCAAGCTTGGATCCGGAAAGATCTGATC^

20 AAAGATCTGATC^GAATTCAAGCTTGGATCCGG
21 AAGATCTGATC^GAATTCAAGCTTGGATCCGGA
 6 AAGCTTGGATCCGGAAAGATCTGATC^GAATTC
 1 AATTCAAGCTTGGATCCGGAAAGATCTGATC^G
22 AGATCTGATC^GAATTCAAGCTTGGATCCGGAA
 7 AGCTTGGATCCGGAAAGATCTGATC^GAATTCA
14 ATCCGGAAAGATCTGATC^GAATTCAAGCTTGG
24 ATCTGATC^GAATTCAAGCTTGGATCCGGAAAG
29 ATC^GAATTCAAGCTTGGATCCGGAAAGATCTG
 2 ATTCAAGCTTGGATCCGGAAAGATCTGATC^GA
 5 CAAGCTTGGATCCGGAAAGATCTGATC^GAATT
16 CCGGAAAGATCTGATC^GAATTCAAGCTTGGAT
17 CGGAAAGATCTGATC^GAATTCAAGCTTGGATC
26 CTGATC^GAATTCAAGCTTGGATCCGGAAAGAT
 9 CTTGGATCCGGAAAGATCTGATC^GAATTCAAG
31 C^GAATTCAAGCTTGGATCCGGAAAGATCTGAT
19 GAAAGATCTGATC^GAATTCAAGCTTGGATCCG
 0 GAATTCAAGCTTGGATCCGGAAAGATCTGATC^
13 GATCCGGAAAGATCTGATC^GAATTCAAGCTTG
23 GATCTGATC^GAATTCAAGCTTGGATCCGGAAA
28 GATC^GAATTCAAGCTTGGATCCGGAAAGATCT
 8 GCTTGGATCCGGAAAGATCTGATC^GAATTCAA
18 GGAAAGATCTGATC^GAATTCAAGCTTGGATCC
12 GGATCCGGAAAGATCTGATC^GAATTCAAGCTT
 4 TCAAGCTTGGATCCGGAAAGATCTGATC^GAAT
15 TCCGGAAAGATCTGATC^GAATTCAAGCTTGGA
25 TCTGATC^GAATTCAAGCTTGGATCCGGAAAGA
30 TC^GAATTCAAGCTTGGATCCGGAAAGATCTGA
27 TGATC^GAATTCAAGCTTGGATCCGGAAAGATC
11 TGGATCCGGAAAGATCTGATC^GAATTCAAGCT
 3 TTCAAGCTTGGATCCGGAAAGATCTGATC^GAA
10 TTGGATCCGGAAAGATCTGATC^GAATTCAAGC
32 ^GAATTCAAGCTTGGATCCGGAAAGATCTGATC

Wednesday, May 30, 2012

Burroughs-Wheeler Transform-BWT (1)

Some time ago I was introduced to bowtie (see this post).

As the front page says:

Bowtie is an ultrafast, memory-efficient short read aligner. It aligns short DNA sequences (reads) to the human genome at a rate of over 25 million 35-bp reads per hour. Bowtie indexes the genome with a Burrows-Wheeler index to keep its memory footprint small: typically about 2.2 GB for the human genome (2.9 GB for paired-end).


So the next question is: what is the Burroughs-Wheeler Transform (BWT) and how does it speed up this process? It turns out that the discovery of BWT was not all that long ago (1994 according to wikipedia), and its application to this bioinformatics problem is more recent than that. BWT has been applied to the problem of data compression in the well known bzip tool. We'll talk about the compression part today, and deal with search in another post.

I wrote a simple Python script to explore the BWT; the listing is at the end of the post. The basic transformation couldn't be easier. Take a short text, like 'BANANA', and add a marker to the end, giving 'BANANA^'.

[UPDATE: I chose '^' because it sorts after the uppercase letters by the default sort. This may grate a little for regex pros who expect '^' to mark the beginning of something. Just define your own cmp function if you are one of these. ]

Generate all the rotational permutations of this text:

BANANA^
^BANANA
A^BANAN
NA^BANA
ANA^BAN
NANA^BA
ANANA^B


Now sort the list:

ANANA^B
ANA^BAN
A^BANAN
BANANA^
NANA^BA
NA^BANA
^BANANA


That's it! The last column, 'BNN^AAA' is the BWT of 'BANANA'. I think you can see that the runs of identical symbols would allow opportunity for efficient compression.

But here's the main point: by itself, the last column contains all of the information that was in the original text. That is, the transformation is reversible or invertible. The inversion depends on the cyclic rotation.

Here are the steps. Take the BWT and sort it:

A
A
A
B
N
N
^


By the way, that's what we already have in column 1 (or 0, for you Pythonistas).

Because of the rotation, the BWT plus column 1 gives all pairs of symbols in the original text. The symbol pairs are (BWT first, then column 1):

BA
NA
NA
^B
AN
AN
A^


Now iteratively do this: sort whatever we generated in the previous step, and then tack on the BWT in front. Here are the triplets:

BAN
NAN
NA^
^BA
ANA
ANA
A^B


Rinse, lather and repeat:

BANA
NANA
NA^B
^BAN
ANAN
ANA^
A^BA

BANAN
NANA^
NA^BA
^BANA
ANANA
ANA^B
A^BAN

BANANA
NANA^B
NA^BAN
^BANAN
ANANA^
ANA^BA
A^BANA


Although 'BANANA' is a nice simple example to begin with, in some respects it is too simple. For example:

> python bwt.py ACGTGAATTCGAAACCGGAA
11 AAACCGGAA^ACGTGAATTCG
12 AACCGGAA^ACGTGAATTCGA
 5 AATTCGAAACCGGAA^ACGTG
18 AA^ACGTGAATTCGAAACCGG
13 ACCGGAA^ACGTGAATTCGAA
 0 ACGTGAATTCGAAACCGGAA^
 6 ATTCGAAACCGGAA^ACGTGA
19 A^ACGTGAATTCGAAACCGGA
14 CCGGAA^ACGTGAATTCGAAA
 9 CGAAACCGGAA^ACGTGAATT
15 CGGAA^ACGTGAATTCGAAAC
 1 CGTGAATTCGAAACCGGAA^A
10 GAAACCGGAA^ACGTGAATTC
 4 GAATTCGAAACCGGAA^ACGT
17 GAA^ACGTGAATTCGAAACCG
16 GGAA^ACGTGAATTCGAAACC
 2 GTGAATTCGAAACCGGAA^AC
 8 TCGAAACCGGAA^ACGTGAAT
 3 TGAATTCGAAACCGGAA^ACG
 7 TTCGAAACCGGAA^ACGTGAA
20 ^ACGTGAATTCGAAACCGGAA

The BWT (last column) does not group all identical letters together. And in the inversion step, the reconstructed text is not necessarily in the first row.

In the output above, I've also listed the "suffix array", which is the index into the original text where we start each row. The suffix array will become important for the application to search, next time.

bwt.py
import sys
from collections import deque

try: 
    text = sys.argv[1]
except IndexError:
    text = 'BANANA'

text += '^'
n = len(text)
#---------------------------------------------
# BWT

rL = list()
dq = deque(text)
for i in range(n):
    rL.append(''.join(list(dq)))
    dq.rotate(1)
rL.sort()

bwt = [s[-1] for s in rL]
sfx = [(n - e.index('^') - 1) for e in rL]

for s,e in zip(sfx,rL):
    print '%2d' % s,
    print e
#---------------------------------------------
# reverse
# take the last col and sort to generate col 0

c0 = sorted(bwt)
print '\n'.join(c0)

# get pairs as a demo:
c1 = [bwt[i]+c0[i] for i in range(n)]
c1.sort()
print '\n'.join(c1)

# iterate starting with bwt again:
L = sorted(bwt)
for j in range(n-2):
    L.sort()
    L = [bwt[i]+L[i] for i in range(n)]
    print '\n'.join(L) + '\n'
    
L = [e for e in L if not '^' in e]
assert ''.join(L[0]) == text[:-1]