Posting has become much rarer here recently, as my energy has flagged.
And this morning I have a rather preemptory request that I fix the old links that were broken as a result of Apple's ditching iDisk. While I still have the files, somewhere, and I could theoretically put them on Dropbox, I'm really not ready to fix what must be several hundred links.
So I thought I would post this warning as what will probably be my last post. It's been fun, and I'm grateful to my readers for having taken an interest. We had more than 50,000 unique visitors and over 150 flags. I just wish someone from Tonga had come by.
If you really must have an old file, I can probably find it.
[ UPDATE: I made a zip of the directory tree which was used to be on iDisk under
http://web.mac.com/telliott99/Python/
It is on Dropbox here and weighs in at 17.8 MB. If you check the old links that don't work, the file should be in that archive. Let me know if you can't find something you'd like.
Thursday, August 9, 2012
Tuesday, July 31, 2012
More fun with geometry
I ran into a fun problem with parallelograms the other day.
The graphic shows a parallelogram broken up into pieces by its diagonals.
Since the opposing sides of a parallelogram are parallel and equal, it's easy to show that the top and bottom triangles are congruent, just rotated, likewise with the other two.
The problem posed was to express the total area in terms of the obtuse angle at the center, the one which is > 90 degrees. Let's call that angle A, and its supplementary angle (the acute angle at the center) B.
The reason why I've drawn the figure in this way is that, as shown below, the triangles can be rearranged to give a different paralellogram with the same area.
x = length of long diagonal (orange + maroon)
y = length of short diagonal
y/2 = 1/2 length of short diagonal (blue)
Area = x (y/2) sin B
The sine values for supplementary angles are equal.
sin (π/2 - θ) = sin θ (ref)
So:
Area = x (y/2) sin A
Nice! Much better than using the law of sines, and for the vectorists out there, expressed even more compactly as the cross-product.
The graphic shows a parallelogram broken up into pieces by its diagonals.
Since the opposing sides of a parallelogram are parallel and equal, it's easy to show that the top and bottom triangles are congruent, just rotated, likewise with the other two.
The problem posed was to express the total area in terms of the obtuse angle at the center, the one which is > 90 degrees. Let's call that angle A, and its supplementary angle (the acute angle at the center) B.
The reason why I've drawn the figure in this way is that, as shown below, the triangles can be rearranged to give a different paralellogram with the same area.
x = length of long diagonal (orange + maroon)
y = length of short diagonal
y/2 = 1/2 length of short diagonal (blue)
Area = x (y/2) sin B
The sine values for supplementary angles are equal.
sin (π/2 - θ) = sin θ (ref)
So:
Area = x (y/2) sin A
Nice! Much better than using the law of sines, and for the vectorists out there, expressed even more compactly as the cross-product.
Saturday, June 16, 2012
variadic arguments
I got curious about functions with variadic arguments like in NSArray:
Here's are two C examples. In either case, it's required to have a named argument that precedes the variable part. In the first method, that variable gives a count of the arguments. In the second, it's not used for anything except to show where to start. A sentinel value shows the end of the variable length argument list. I also incorporated
+ (id)arrayWithObjects:(id)firstObj, ...
Here's are two C examples. In either case, it's required to have a named argument that precedes the variable part. In the first method, that variable gives a count of the arguments. In the second, it's not used for anything except to show where to start. A sentinel value shows the end of the variable length argument list. I also incorporated
__func__
, which I came across in my reading and was new to me.// clang variadic.c -o prog #include <stdio.h> #include <stdarg.h> int f(int count, ...) { printf("%s%s", __func__, "\n"); int n, sum; va_list ap; va_start(ap, count); sum = 0; for ( ; count > 0; count--) { n = va_arg(ap, int); printf("n = %3d\n", n); sum += n; } va_end(ap); return sum; } int g(int x, ...) { printf("%s%s", __func__, "\n"); fflush(stdout); int n, sum; va_list ap; va_start(ap, x); n = x; sum = 0; while (!(0)) { printf("n = %3d\n", n); sum += n; n = va_arg(ap, int); if (n == 0) { break; } } va_end(ap); return sum; } int main(int argc, char * argv[]) { int result = f(3,1,2,3); printf ("sum = %d\n", result); result = g(1,2,3,0); printf ("sum = %d\n", result); return 0; } |
> ./prog f n = 1 n = 2 n = 3 sum = 6 g n = 1 n = 2 n = 3 sum = 6 |
NSString (1)
Test harness:
// clang strings1.m -o prog -framework Foundation -fobjc-gc-only #include <Foundation/Foundation.h> int main(int argc, char * argv[]) { // ..code here.. return 0; } |
NSString *s; NSMutableArray *ma = [NSMutableArray arrayWithCapacity:10]; char *cstr = "abc"; // real C string w/ "\0" also works uint ascii = NSASCIIStringEncoding; uint utf8 = NSUTF8StringEncoding; [ma addObject:[NSString stringWithCString:cstr encoding:ascii]]; [ma addObject:[NSString stringWithCString:cstr encoding:utf8]]; [ma addObject:[NSString stringWithUTF8String:cstr]]; NSMutableString *ms = [[NSMutableString alloc] init]; [ms appendString:@"abde"]; [ms insertString:@"c" atIndex:2]; [ma addObject:[ms copy]]; // otherwise code below will alter [ms appendString:@"fghij"]; [ms replaceOccurrencesOfString:@"B" withString:@"*" options:NSCaseInsensitiveSearch range:NSMakeRange(0,2) ]; [ma addObject:ms]; for (id obj in ma) { NSLog(@"%@ %@", obj, [obj class]); } [ma removeAllObjects]; printf("-----------------\n"); |
> ./prog 2012-05-29 16:45:03.624 prog[444:707] abc __NSCFString 2012-05-29 16:45:03.626 prog[444:707] abc __NSCFString 2012-05-29 16:45:03.627 prog[444:707] abc __NSCFString 2012-05-29 16:45:03.627 prog[444:707] abcde __NSCFString 2012-05-29 16:45:03.628 prog[444:707] a*cdefghij __NSCFString ----------------- |
[ma addObject:[NSString stringWithFormat:@"abcd"]]; [ma addObject:[NSString stringWithFormat:@"ab%@", @"cd" @"ef"]]; [ma addObject:[NSString stringWithFormat:@"a%C%C", 0x62, 0x63]]; char b[] = { 0x61, 0x62, 0x63, 0x64, 0x65 }; NSData *d = [NSData dataWithBytes:b length:5]; s = [[NSString alloc] initWithData:d encoding:utf8]; [ma addObject:s]; for (id obj in ma) { NSLog(@"%@ %@", obj, [obj class]); } printf("-----------------\n"); |
2012-05-29 16:45:03.629 prog[444:707] abcd __NSCFString 2012-05-29 16:45:03.629 prog[444:707] abcdef __NSCFString 2012-05-29 16:45:03.630 prog[444:707] abc __NSCFString 2012-05-29 16:45:03.630 prog[444:707] abcde __NSCFString ----------------- |
d = [NSData dataWithBytes:b length:5]; [d writeToFile:@"x" atomically:YES]; NSError *err = nil; NSStringEncoding enc = NSUTF8StringEncoding; s = [NSString stringWithContentsOfFile:@"x" usedEncoding:&enc error:&err]; if (err != nil) { NSLog(@"%@", [err userInfo]); } else { NSLog(@"%@", s ); } b[4] = 0xff; d = [NSData dataWithBytes:b length:5]; [d writeToFile:@"x" atomically:YES]; s = [NSString stringWithContentsOfFile:@"x" usedEncoding:&enc error:&err]; if (err != nil) { NSLog(@"%@", [err localizedDescription]); } else { NSLog(@"%@", s ); } |
2012-05-29 16:45:03.641 prog[444:707] abcde 2012-05-29 16:45:03.658 prog[444:707] The file “x” couldn’t be opened because the text encoding of its contents can’t be determined. |
Thursday, May 31, 2012
Burroughs-Wheeler Transform-BWT (3)
I'm exploring the Burroughs-Wheeler Transform, which has become important for fast methods of matching short reads to large genomes. The first two posts are here and here.
In the first we saw what BWT is and showed how to reverse or invert it. In the second, we saw how to use BWT to do a search. Here, I show a fairly efficient way to generate the BWT for a sequence.
Last time, we used a script which does a rotation on a deque followed by sorting to generate the BWT of a sequence:
and obtained
The result matches what the visualization tool on this page generates. However, our script does a lot more work (and uses more memory) than is really necessary. If we look again at the first few lines of output from bwt.py:
Each number is the index into the sequence where the row (the suffix) begins (0-based indexing). The array of numbers from top to bottom of the output is the suffix array.
As you can see from the last part of the script below, the BWT is easily generated from the suffix array, it's just text[j-1] for each j in the array. That's because the BWT is the last column and the sequences wrap, so each character in the BWT is -1 with respect to the character in the first column for any row.
The Python script shown here sorts the rows but only looks at as many of the characters as are needed to do the sort. We have a list L to contain the growing suffix array. This consists of indexes j into the text, one corresponding to each row, which are in sorted order in the suffix array.
We move an index i along the text and for the suffix starting at each i, find the place where it belongs in the suffix array, then insert it.
I'm sure there are even faster ways to generate the BWT for a text. The most glaring problem with this attempt is that repeated insertion into a list is expensive. We need a data structure that combines ease of insertion with ease of traversal (like a linked list).
Output:
If you compare this with what we got last time you'll see it matches.
fast_bwt.py
In the first we saw what BWT is and showed how to reverse or invert it. In the second, we saw how to use BWT to do a search. Here, I show a fairly efficient way to generate the BWT for a sequence.
Last time, we used a script which does a rotation on a deque followed by sorting to generate the BWT of a sequence:
GAATTCAAGCTTGGATCCGGAAAGATCTGATC
and obtained
GACGAAGGGATTCTGTG^GATACTTAAACTACC
The result matches what the visualization tool on this page generates. However, our script does a lot more work (and uses more memory) than is really necessary. If we look again at the first few lines of output from bwt.py:
20 AAAGATCTGATC^GAATTCAAGCTTGGATCCGG 21 AAGATCTGATC^GAATTCAAGCTTGGATCCGGA 6 AAGCTTGGATCCGGAAAGATCTGATC^GAATTC |
Each number is the index into the sequence where the row (the suffix) begins (0-based indexing). The array of numbers from top to bottom of the output is the suffix array.
As you can see from the last part of the script below, the BWT is easily generated from the suffix array, it's just text[j-1] for each j in the array. That's because the BWT is the last column and the sequences wrap, so each character in the BWT is -1 with respect to the character in the first column for any row.
The Python script shown here sorts the rows but only looks at as many of the characters as are needed to do the sort. We have a list L to contain the growing suffix array. This consists of indexes j into the text, one corresponding to each row, which are in sorted order in the suffix array.
We move an index i along the text and for the suffix starting at each i, find the place where it belongs in the suffix array, then insert it.
I'm sure there are even faster ways to generate the BWT for a text. The most glaring problem with this attempt is that repeated insertion into a list is expensive. We need a data structure that combines ease of insertion with ease of traversal (like a linked list).
Output:
> python fast_bwt3.py GACGAAGGGATTCTGTGGATACTTAAACTACC 20 21 6 1 22 7 14 24 29 2 5 16 17 26 9 31 19 13 23 28 8 18 12 4 15 25 30 27 11 3 10 32 |
If you compare this with what we got last time you'll see it matches.
fast_bwt.py
#text = 'BANANA' text = 'GAATTCAAGCTTGGATCCGGAAAGATCTGATC' t = text + '^' N = len(t) # L is the suffix array L = list() # i: index to walk current suffix along text # j: indexes for sorted suffix already in suffix array # k: enumeration of L, to give insertion point def doOne(i,L): for k,j in enumerate(L): n = 0 while t[i+n] == t[j+n]: n += 1 if t[i+n] < t[j+n]: L.insert(k,i) return L.append(i) for i in range(1,N): doOne(i,L) bwt = ''.join([t[j-1] for j in L]) print bwt, '\n' for k,j in enumerate(L): if k and not k % 10: print print '%3d' % j, |
Burroughs-Wheeler Transform-BWT (2)
I'm exploring the Burroughs-Wheeler Transform, which has become very important for fast methods of matching short reads to large genomes. The first post is here.
I found a nice page about it here, which is to be part of a series. There is a great applet(?), anyhow a tool for visualization is on that page, that can walk you through the steps of the search. I found it very helpful for understanding what's going on. I hope my explanation is also useful.
With the output from the Python script from yesterday, we'll try using BWT to search for the query
We'll work from the end of the query backwords. The rows we're working on can be found in the output from the script, which is listed at the end of this post.
Q1 = C.
Begin by selecting all the rows that start with C:
I found a nice page about it here, which is to be part of a series. There is a great applet(?), anyhow a tool for visualization is on that page, that can walk you through the steps of the search. I found it very helpful for understanding what's going on. I hope my explanation is also useful.
With the output from the Python script from yesterday, we'll try using BWT to search for the query
GGATC
in the text:GAATTCAAGCTTGGATCCGGAAAGATCTGATC
We'll work from the end of the query backwords. The rows we're working on can be found in the output from the script, which is listed at the end of this post.
Q1 = C.
Begin by selecting all the rows that start with C: