Python for Bioinformatics: 2012

Thursday, August 9, 2012

See ya

Posting has become much rarer here recently, as my energy has flagged.

And this morning I have a rather preemptory request that I fix the old links that were broken as a result of Apple's ditching iDisk. While I still have the files, somewhere, and I could theoretically put them on Dropbox, I'm really not ready to fix what must be several hundred links.

So I thought I would post this warning as what will probably be my last post. It's been fun, and I'm grateful to my readers for having taken an interest. We had more than 50,000 unique visitors and over 150 flags. I just wish someone from Tonga had come by.

If you really must have an old file, I can probably find it.

[ UPDATE: I made a zip of the directory tree which was used to be on iDisk under
http://web.mac.com/telliott99/Python/

It is on Dropbox here and weighs in at 17.8 MB. If you check the old links that don't work, the file should be in that archive. Let me know if you can't find something you'd like.

Tuesday, July 31, 2012

More fun with geometry

I ran into a fun problem with parallelograms the other day.

The graphic shows a parallelogram broken up into pieces by its diagonals.

Since the opposing sides of a parallelogram are parallel and equal, it's easy to show that the top and bottom triangles are congruent, just rotated, likewise with the other two.

The problem posed was to express the total area in terms of the obtuse angle at the center, the one which is > 90 degrees. Let's call that angle A, and its supplementary angle (the acute angle at the center) B.

The reason why I've drawn the figure in this way is that, as shown below, the triangles can be rearranged to give a different paralellogram with the same area.

x = length of long diagonal (orange + maroon)
y = length of short diagonal
y/2 = 1/2 length of short diagonal (blue)

Area = x (y/2) sin B

The sine values for supplementary angles are equal.

sin (π/2 - θ) = sin θ (ref)

So:

Area = x (y/2) sin A

Nice! Much better than using the law of sines, and for the vectorists out there, expressed even more compactly as the cross-product.

Saturday, June 16, 2012

variadic arguments

I got curious about functions with variadic arguments like in NSArray:

+ (id)arrayWithObjects:(id)firstObj, ...

Here's are two C examples. In either case, it's required to have a named argument that precedes the variable part. In the first method, that variable gives a count of the arguments. In the second, it's not used for anything except to show where to start. A sentinel value shows the end of the variable length argument list. I also incorporated __func__, which I came across in my reading and was new to me.

// clang variadic.c -o prog
#include <stdio.h>
#include <stdarg.h>

int f(int count, ...)
{
  printf("%s%s", __func__, "\n");
  int n, sum; 
  va_list ap;
  va_start(ap, count);
  sum = 0;
  for ( ; count > 0; count--) {
      n = va_arg(ap, int);
      printf("n = %3d\n", n);
      sum += n;
  }
  va_end(ap);
  return sum;
}

int g(int x, ...)
{
  printf("%s%s", __func__, "\n");
  fflush(stdout);
  int n, sum; 
  va_list ap;
  va_start(ap, x);
  n = x;
  sum = 0;
  while (!(0)) {
      printf("n = %3d\n", n);
      sum += n;
      n = va_arg(ap, int);
      if (n == 0) { break; }
  }
  va_end(ap);
  return sum;
}

int main(int argc, char * argv[]) {
    int result = f(3,1,2,3);
    printf ("sum = %d\n", result);
    result = g(1,2,3,0);
    printf ("sum = %d\n", result);
    return 0;
}

> ./prog
f
n =   1
n =   2
n =   3
sum = 6
g
n =   1
n =   2
n =   3
sum = 6

NSString (1)

Test harness:

// clang strings1.m -o prog -framework Foundation -fobjc-gc-only
#include <Foundation/Foundation.h>

int main(int argc, char * argv[]) {

    // ..code here..

    return 0;
}

NSString *s;
    NSMutableArray *ma = [NSMutableArray arrayWithCapacity:10];

    char *cstr = "abc";   // real C string w/ "\0" also works
    uint ascii = NSASCIIStringEncoding;
    uint utf8 = NSUTF8StringEncoding;
    [ma addObject:[NSString stringWithCString:cstr encoding:ascii]];
    [ma addObject:[NSString stringWithCString:cstr encoding:utf8]];
    [ma addObject:[NSString stringWithUTF8String:cstr]];
    NSMutableString *ms = [[NSMutableString alloc] init];
    [ms appendString:@"abde"];
    [ms insertString:@"c" atIndex:2];
    [ma addObject:[ms copy]];  // otherwise code below will alter
    [ms appendString:@"fghij"];
    [ms replaceOccurrencesOfString:@"B"
                        withString:@"*"
                           options:NSCaseInsensitiveSearch
                             range:NSMakeRange(0,2) ];
    [ma addObject:ms];
    for (id obj in ma) { NSLog(@"%@ %@", obj, [obj class]); }
    [ma removeAllObjects];
    printf("-----------------\n");

> ./prog
2012-05-29 16:45:03.624 prog[444:707] abc __NSCFString
2012-05-29 16:45:03.626 prog[444:707] abc __NSCFString
2012-05-29 16:45:03.627 prog[444:707] abc __NSCFString
2012-05-29 16:45:03.627 prog[444:707] abcde __NSCFString
2012-05-29 16:45:03.628 prog[444:707] a*cdefghij __NSCFString
-----------------

[ma addObject:[NSString stringWithFormat:@"abcd"]];
    [ma addObject:[NSString stringWithFormat:@"ab%@", @"cd" @"ef"]];
    [ma addObject:[NSString stringWithFormat:@"a%C%C", 0x62, 0x63]];
    char b[] = { 0x61, 0x62, 0x63, 0x64, 0x65 };
    NSData *d = [NSData dataWithBytes:b length:5];
    s = [[NSString alloc] initWithData:d encoding:utf8];
    [ma addObject:s];
    for (id obj in ma) { NSLog(@"%@ %@", obj, [obj class]); }
    printf("-----------------\n");

2012-05-29 16:45:03.629 prog[444:707] abcd __NSCFString
2012-05-29 16:45:03.629 prog[444:707] abcdef __NSCFString
2012-05-29 16:45:03.630 prog[444:707] abc __NSCFString
2012-05-29 16:45:03.630 prog[444:707] abcde __NSCFString
-----------------

d = [NSData dataWithBytes:b length:5];
    [d writeToFile:@"x" atomically:YES];
    NSError *err = nil;
    NSStringEncoding enc = NSUTF8StringEncoding;
    s = [NSString stringWithContentsOfFile:@"x" 
                              usedEncoding:&enc
                                     error:&err];
    if (err != nil) { NSLog(@"%@", [err userInfo]); }
    else {  NSLog(@"%@", s );  }

    b[4] = 0xff;
    d = [NSData dataWithBytes:b length:5];
    [d writeToFile:@"x" atomically:YES];
    s = [NSString stringWithContentsOfFile:@"x" 
                              usedEncoding:&enc 
                                     error:&err];
    if (err != nil) { NSLog(@"%@", [err localizedDescription]); } 
    else {  NSLog(@"%@", s );  }

2012-05-29 16:45:03.641 prog[444:707] abcde
2012-05-29 16:45:03.658 prog[444:707] The file “x” couldn’t be opened because the text encoding of its contents can’t be determined.

Thursday, May 31, 2012

Burroughs-Wheeler Transform-BWT (3)

I'm exploring the Burroughs-Wheeler Transform, which has become important for fast methods of matching short reads to large genomes. The first two posts are here and here.

In the first we saw what BWT is and showed how to reverse or invert it. In the second, we saw how to use BWT to do a search. Here, I show a fairly efficient way to generate the BWT for a sequence.

Last time, we used a script which does a rotation on a deque followed by sorting to generate the BWT of a sequence:

GAATTCAAGCTTGGATCCGGAAAGATCTGATC

and obtained

GACGAAGGGATTCTGTG^GATACTTAAACTACC

The result matches what the visualization tool on this page generates. However, our script does a lot more work (and uses more memory) than is really necessary. If we look again at the first few lines of output from bwt.py:

20 AAAGATCTGATC^GAATTCAAGCTTGGATCCGG
21 AAGATCTGATC^GAATTCAAGCTTGGATCCGGA
 6 AAGCTTGGATCCGGAAAGATCTGATC^GAATTC

Each number is the index into the sequence where the row (the suffix) begins (0-based indexing). The array of numbers from top to bottom of the output is the suffix array.

As you can see from the last part of the script below, the BWT is easily generated from the suffix array, it's just text[j-1] for each j in the array. That's because the BWT is the last column and the sequences wrap, so each character in the BWT is -1 with respect to the character in the first column for any row.

The Python script shown here sorts the rows but only looks at as many of the characters as are needed to do the sort. We have a list L to contain the growing suffix array. This consists of indexes j into the text, one corresponding to each row, which are in sorted order in the suffix array.

We move an index i along the text and for the suffix starting at each i, find the place where it belongs in the suffix array, then insert it.

I'm sure there are even faster ways to generate the BWT for a text. The most glaring problem with this attempt is that repeated insertion into a list is expensive. We need a data structure that combines ease of insertion with ease of traversal (like a linked list).

Output:

> python fast_bwt3.py 
GACGAAGGGATTCTGTGGATACTTAAACTACC 

 20  21   6   1  22   7  14  24  29   2
  5  16  17  26   9  31  19  13  23  28
  8  18  12   4  15  25  30  27  11   3
 10  32

If you compare this with what we got last time you'll see it matches.

fast_bwt.py

#text = 'BANANA'
text = 'GAATTCAAGCTTGGATCCGGAAAGATCTGATC' 
t = text + '^'
N = len(t)

# L is the suffix array
L = list()

# i: index to walk current suffix along text
# j: indexes for sorted suffix already in suffix array
# k: enumeration of L, to give insertion point

def doOne(i,L):
    for k,j in enumerate(L):
        n = 0
        while t[i+n] == t[j+n]:
            n += 1
        if t[i+n] < t[j+n]:
            L.insert(k,i)
            return
    L.append(i)

for i in range(1,N):
    doOne(i,L)

bwt = ''.join([t[j-1] for j in L])
print bwt, '\n'
for k,j in enumerate(L):
    if k and not k % 10:
        print
    print '%3d' % j,

Burroughs-Wheeler Transform-BWT (2)

I'm exploring the Burroughs-Wheeler Transform, which has become very important for fast methods of matching short reads to large genomes. The first post is here.

I found a nice page about it here, which is to be part of a series. There is a great applet(?), anyhow a tool for visualization is on that page, that can walk you through the steps of the search. I found it very helpful for understanding what's going on. I hope my explanation is also useful.

With the output from the Python script from yesterday, we'll try using BWT to search for the query GGATC in the text:

GAATTCAAGCTTGGATCCGGAAAGATCTGATC

We'll work from the end of the query backwords. The rows we're working on can be found in the output from the script, which is listed at the end of this post.

Q1 = C.
Begin by selecting all the rows that start with C:

 5 CAAGCTTGGATCCGGAAAGATCTGATC^GAATT
16 CCGGAAAGATCTGATC^GAATTCAAGCTTGGAT
17 CGGAAAGATCTGATC^GAATTCAAGCTTGGATC
26 CTGATC^GAATTCAAGCTTGGATCCGGAAAGAT
 9 CTTGGATCCGGAAAGATCTGATC^GAATTCAAG
31 C^GAATTCAAGCTTGGATCCGGAAAGATCTGAT

Q2 = T
So next, restricted to this selection,
find all the positions in the BWT (last column) that contain T.
The first one is

5 CAAGCTTGGATCCGGAAAGATCTGATC^GAATT

Within the BWT as a whole, this is the first T.
There are four T's in the BWT in the range selected.
So our range is now 1-4.

Next, go to the rows starting with T,
that are of rank 1-4 with respect to T:

 4 TCAAGCTTGGATCCGGAAAGATCTGATC^GAAT
15 TCCGGAAAGATCTGATC^GAATTCAAGCTTGGA
25 TCTGATC^GAATTCAAGCTTGGATCCGGAAAGA
30 TC^GAATTCAAGCTTGGATCCGGAAAGATCTGA

Q3 = A.
Find all the rows with A in the BWT for this range.
The first one is:

15 TCCGGAAAGATCTGATC^GAATTCAAGCTTGGA

Within the BWT as a whole, this is the seventh A.
There are three A's in the BWT in the range selected.
So our range is now 7-9.

Next, go to the rows starting with A,
that are of rank 7-9 with respect to A:

14 ATCCGGAAAGATCTGATC^GAATTCAAGCTTGG
24 ATCTGATC^GAATTCAAGCTTGGATCCGGAAAG
29 ATC^GAATTCAAGCTTGGATCCGGAAAGATCTG

Q4 = G.

Find all the rows with G in the BWT for this range.
That would be all of them.

The first one is:

14 ATCCGGAAAGATCTGATC^GAATTCAAGCTTGG

Within the BWT as a whole, this is the 3rd G.
So our range is now 3-5.

Finally, go to the rows starting with G,
that are of rank 3-5 with respect to G:

13 GATCCGGAAAGATCTGATC^GAATTCAAGCTTG
23 GATCTGATC^GAATTCAAGCTTGGATCCGGAAA
28 GATC^GAATTCAAGCTTGGATCCGGAAAGATCT

Q5 is G.
There is one row with a match.
Since is the last letter of the query, we're done.

Our match has index 13.

That's the position of the query GATCC in the text:

GAATTCAAGCTTGGATCCGGAAAGATCTGATC
             +

using zero-based indexing.

Notice that we don't use the whole table, and we don't use the suffix array (the indexes) until the end. Instead, all we use is the BWT and the sorted characters (actually, the index of the first A, C, etc.):

txt GAATTCAAGCTTGGATCCGGAAAGATCTGATC^

BWT GACGAAGGGATTCTGTG^GATACTTAAACTACC
chr AAAAAAAAAACCCCCCGGGGGGGGTTTTTTTT^

One more thing. If we have a selected range in the BWT and are finding say, all the G's from index 6-10, we need to find quickly how many G's come before that position in the BWT.

Not using the whole table is important, because the BWT has to be memory-efficient. It would not be useful if we required memory proportional to the square of the genome size.

I think it might be worth it to work the whole example again, using just the BWT and the sorted characters:

[ NOTE: Blogger keeps screwing with the formatting. Please let me know if it doesn't seems correct later on. ]

BWT GACGAAGGGATTCTGTG^GATACTTAAACTACC
chr AAAAAAAAAACCCCCCGGGGGGGGTTTTTTTT^

Q1 = C
Q2 = T
          +
GACGAAGGGATTCTGTG^GATACTTAAACTACC
AAAAAAAAAACCCCCCGGGGGGGGTTTTTTTT^
          ------

The indicated T is T1 in the BWT
R = 1-4

Go to T1-T4

Q3 = A

     +  ++   +         + +   +
BWT GACGAAGGGATTCTGTG^GATACTTAAACTACC
chr AAAAAAAAAACCCCCCGGGGGGGGTTTTTTTT^
                            ----

The indicated A is A7 in the BWT
R = 7-9

Go to A7-A9

Q4 = G

    +  +  +
BWT GACGAAGGGATTCTGTG^GATACTTAAACTACC
chr AAAAAAAAAACCCCCCGGGGGGGGTTTTTTTT^
          ---

The indicated G is G3 in the BWT
R = 3-5

Go to G3-G5

Q5 = G

    +  +  +++     + + +   
BWT GACGAAGGGATTCTGTG^GATACTTAAACTACC
chr AAAAAAAAAACCCCCCGGGGGGGGTTTTTTTT^
                      ---

The indicated G is G8 in the BWT
We're done.

The index (see below) is 13 for the match.

output from bwt.py

seq
   GAATTCAAGCTTGGATCCGGAAAGATCTGATC^

20 AAAGATCTGATC^GAATTCAAGCTTGGATCCGG
21 AAGATCTGATC^GAATTCAAGCTTGGATCCGGA
 6 AAGCTTGGATCCGGAAAGATCTGATC^GAATTC
 1 AATTCAAGCTTGGATCCGGAAAGATCTGATC^G
22 AGATCTGATC^GAATTCAAGCTTGGATCCGGAA
 7 AGCTTGGATCCGGAAAGATCTGATC^GAATTCA
14 ATCCGGAAAGATCTGATC^GAATTCAAGCTTGG
24 ATCTGATC^GAATTCAAGCTTGGATCCGGAAAG
29 ATC^GAATTCAAGCTTGGATCCGGAAAGATCTG
 2 ATTCAAGCTTGGATCCGGAAAGATCTGATC^GA
 5 CAAGCTTGGATCCGGAAAGATCTGATC^GAATT
16 CCGGAAAGATCTGATC^GAATTCAAGCTTGGAT
17 CGGAAAGATCTGATC^GAATTCAAGCTTGGATC
26 CTGATC^GAATTCAAGCTTGGATCCGGAAAGAT
 9 CTTGGATCCGGAAAGATCTGATC^GAATTCAAG
31 C^GAATTCAAGCTTGGATCCGGAAAGATCTGAT
19 GAAAGATCTGATC^GAATTCAAGCTTGGATCCG
 0 GAATTCAAGCTTGGATCCGGAAAGATCTGATC^
13 GATCCGGAAAGATCTGATC^GAATTCAAGCTTG
23 GATCTGATC^GAATTCAAGCTTGGATCCGGAAA
28 GATC^GAATTCAAGCTTGGATCCGGAAAGATCT
 8 GCTTGGATCCGGAAAGATCTGATC^GAATTCAA
18 GGAAAGATCTGATC^GAATTCAAGCTTGGATCC
12 GGATCCGGAAAGATCTGATC^GAATTCAAGCTT
 4 TCAAGCTTGGATCCGGAAAGATCTGATC^GAAT
15 TCCGGAAAGATCTGATC^GAATTCAAGCTTGGA
25 TCTGATC^GAATTCAAGCTTGGATCCGGAAAGA
30 TC^GAATTCAAGCTTGGATCCGGAAAGATCTGA
27 TGATC^GAATTCAAGCTTGGATCCGGAAAGATC
11 TGGATCCGGAAAGATCTGATC^GAATTCAAGCT
 3 TTCAAGCTTGGATCCGGAAAGATCTGATC^GAA
10 TTGGATCCGGAAAGATCTGATC^GAATTCAAGC
32 ^GAATTCAAGCTTGGATCCGGAAAGATCTGATC

Wednesday, May 30, 2012

Burroughs-Wheeler Transform-BWT (1)

Some time ago I was introduced to bowtie (see this post).

As the front page says:

Bowtie is an ultrafast, memory-efficient short read aligner. It aligns short DNA sequences (reads) to the human genome at a rate of over 25 million 35-bp reads per hour. Bowtie indexes the genome with a Burrows-Wheeler index to keep its memory footprint small: typically about 2.2 GB for the human genome (2.9 GB for paired-end).

So the next question is: what is the Burroughs-Wheeler Transform (BWT) and how does it speed up this process? It turns out that the discovery of BWT was not all that long ago (1994 according to wikipedia), and its application to this bioinformatics problem is more recent than that. BWT has been applied to the problem of data compression in the well known bzip tool. We'll talk about the compression part today, and deal with search in another post.

I wrote a simple Python script to explore the BWT; the listing is at the end of the post. The basic transformation couldn't be easier. Take a short text, like 'BANANA', and add a marker to the end, giving 'BANANA^'.

[UPDATE: I chose '^' because it sorts after the uppercase letters by the default sort. This may grate a little for regex pros who expect '^' to mark the beginning of something. Just define your own cmp function if you are one of these. ]

Generate all the rotational permutations of this text:

BANANA^

^BANANA

A^BANAN

NA^BANA

ANA^BAN

NANA^BA

ANANA^B

Now sort the list:

ANANA^B

ANA^BAN

A^BANAN

BANANA^

NANA^BA

NA^BANA

^BANANA

That's it! The last column, 'BNN^AAA' is the BWT of 'BANANA'. I think you can see that the runs of identical symbols would allow opportunity for efficient compression.

But here's the main point: by itself, the last column contains all of the information that was in the original text. That is, the transformation is reversible or invertible. The inversion depends on the cyclic rotation.

Here are the steps. Take the BWT and sort it:

A

A

A

B

N

N

^

By the way, that's what we already have in column 1 (or 0, for you Pythonistas).

Because of the rotation, the BWT plus column 1 gives all pairs of symbols in the original text. The symbol pairs are (BWT first, then column 1):

BA

NA

NA

^B

AN

AN

A^

Now iteratively do this: sort whatever we generated in the previous step, and then tack on the BWT in front. Here are the triplets:

BAN

NAN

NA^

^BA

ANA

ANA

A^B

Rinse, lather and repeat:

BANA

NANA

NA^B

^BAN

ANAN

ANA^

A^BA



BANAN

NANA^

NA^BA

^BANA

ANANA

ANA^B

A^BAN



BANANA

NANA^B

NA^BAN

^BANAN

ANANA^

ANA^BA

A^BANA

Although 'BANANA' is a nice simple example to begin with, in some respects it is too simple. For example:

> python bwt.py ACGTGAATTCGAAACCGGAA
11 AAACCGGAA^ACGTGAATTCG
12 AACCGGAA^ACGTGAATTCGA
 5 AATTCGAAACCGGAA^ACGTG
18 AA^ACGTGAATTCGAAACCGG
13 ACCGGAA^ACGTGAATTCGAA
 0 ACGTGAATTCGAAACCGGAA^
 6 ATTCGAAACCGGAA^ACGTGA
19 A^ACGTGAATTCGAAACCGGA
14 CCGGAA^ACGTGAATTCGAAA
 9 CGAAACCGGAA^ACGTGAATT
15 CGGAA^ACGTGAATTCGAAAC
 1 CGTGAATTCGAAACCGGAA^A
10 GAAACCGGAA^ACGTGAATTC
 4 GAATTCGAAACCGGAA^ACGT
17 GAA^ACGTGAATTCGAAACCG
16 GGAA^ACGTGAATTCGAAACC
 2 GTGAATTCGAAACCGGAA^AC
 8 TCGAAACCGGAA^ACGTGAAT
 3 TGAATTCGAAACCGGAA^ACG
 7 TTCGAAACCGGAA^ACGTGAA
20 ^ACGTGAATTCGAAACCGGAA

The BWT (last column) does not group all identical letters together. And in the inversion step, the reconstructed text is not necessarily in the first row.

In the output above, I've also listed the "suffix array", which is the index into the original text where we start each row. The suffix array will become important for the application to search, next time.

bwt.py

import sys
from collections import deque

try: 
    text = sys.argv[1]
except IndexError:
    text = 'BANANA'

text += '^'
n = len(text)
#---------------------------------------------
# BWT

rL = list()
dq = deque(text)
for i in range(n):
    rL.append(''.join(list(dq)))
    dq.rotate(1)
rL.sort()

bwt = [s[-1] for s in rL]
sfx = [(n - e.index('^') - 1) for e in rL]

for s,e in zip(sfx,rL):
    print '%2d' % s,
    print e
#---------------------------------------------
# reverse
# take the last col and sort to generate col 0

c0 = sorted(bwt)
print '\n'.join(c0)

# get pairs as a demo:
c1 = [bwt[i]+c0[i] for i in range(n)]
c1.sort()
print '\n'.join(c1)

# iterate starting with bwt again:
L = sorted(bwt)
for j in range(n-2):
    L.sort()
    L = [bwt[i]+L[i] for i in range(n)]
    print '\n'.join(L) + '\n'
    
L = [e for e in L if not '^' in e]
assert ''.join(L[0]) == text[:-1]

Sunday, May 27, 2012

Blocks (7)

As long as I was thinking about NSAnimation (here), I wrote a command-line program that uses it. This is a minimal example.

In the process, I realized that the callback can be done with a block. I'm not sure what the approved method to do this would be; I looked at the delegate methods and they're not promising. I guess if we were instantiating the animation from an object we'd just pass self as an argument to the alternative init method. In a GUI app we'd use an IBOutlet and never need to do the explicit init in our code.

Output:

> ./prog
2012-05-27 12:08:12.794 prog[7813:707] Progress: 0.20
2012-05-27 12:08:12.994 prog[7813:707] Progress: 0.40
2012-05-27 12:08:13.194 prog[7813:707] Progress: 0.60
2012-05-27 12:08:13.394 prog[7813:707] Progress: 0.80
2012-05-27 12:08:13.595 prog[7813:707] Progress: 1.00

// clang prog.m -o prog -framework Cocoa -fobjc-gc-only
#import <Cocoa/Cocoa.h>

typedef void (^B)(NSAnimationProgress p);
B b = ^(NSAnimationProgress p){ NSLog(@"Progress: %.2f", p); };

@interface MyAnimation : NSAnimation
@end

@implementation MyAnimation

B myBlock;

- (id)init
{
    self = [super initWithDuration:1.0 
                    animationCurve:NSAnimationLinear];
    if (self) {
        [self setFrameRate:5.0];
        [self setAnimationBlockingMode:NSAnimationNonblocking];
    }
    return self;
}

- (id)initWithBlock:(B)b {
    self = [self init];
    myBlock = b;
    return self;
}

- (void)setCurrentProgress:(NSAnimationProgress)progress {
    myBlock(progress);
}

@end

int main(int argc, char * argv[]) {
    NSAnimation *a = [[MyAnimation alloc] initWithBlock:b];
    [a startAnimation];

    NSRunLoop* rl = [NSRunLoop currentRunLoop];
    [rl runUntilDate:[NSDate dateWithTimeIntervalSinceNow:1.2]];
    return 0;
}

Bezier curve

I started on an exploration of the NSBezierPath function curveToPoint:controlPoint1:controlPoint2: which is used to draw curves in Cocoa.

The math is relatively easy to state (wikipedia). For a cubic curve, we'll have a total of four points, two where the curve starts and stops (P0 and P3) and two control points (P1 and P2). In the screenshots above, P0 and P3 are in black at the top and bottom and the two control points are in red. The blue circle shows the current point along the trajectory. The left one is at an early time, and the right one very late.

[ The purple circles show the current position for the two quadratic Bezier curves from which the cubic one may be constructed. ]

The curve is represented as a parametric equation in the variable t (standing for time, since curves are often thought of as trajectories).

A standard equation is:

P0(1-t)³ + P1(3)(t)(1-t)² + P2(3)(t)²(1-t) + P3(t)³

At t = 0 we're at P0 and at t = 1 we're at P3. Beyond that, it's hard to visualize. According to wikipedia:

The curve starts at P0 going toward P1 and arrives at P3 coming from the direction of P2. Usually, it will not pass through P1 or P2; these points are only there to provide directional information. The distance between P0 and P1 determines "how long" the curve moves into direction P2 before turning towards P3.

Writing B_Pi,Pj,Pk(t) for the quadratic Bézier curve defined by points Pi, Pj, and Pk, the cubic Bézier curve can be defined as a linear combination of two quadratic Bézier curves:

B(t) = (1-t)B_P0,P1,P2 + (t)B_{P1,P2, P3}; 0 <= t <= 1

where

B_P0,P1,P2 = P0(1-t)² + P1(t)(1-t) + P2(t)²

As I say, I found it hard to visualize. So I wrote a little Xcode project that does two things. It presents an animation of the progress of the trajectory. At the moment it can be stopped and restarted, but not paused.

Also, the two control points P1 and P2 can be dragged to different positions to explore how they influence the curve.

I took the dragging and animation code from two earlier PyObjC projects (here and here).

My hope was to reproduce the animations shown in the wikipedia article, and see how the curves combine to produce the result. But I've run out of time for the moment, so I thought I'd just show my progress to date. The zipped project is on Dropbox (here).

View with no window

Scott Stevenson has two great posts about tutorials on drawing in Quartz 2D (here and here). I used his code to draw the graphic above.

One issue I've had was whether it's really necessary to use Core Graphics to draw to PDF from a command-line app. It's not, as this code shows. Thanks to a Stack Overflow answer for help.

// clang prog.m -o prog -framework Cocoa -fobjc-gc-only
#import <Cocoa/Cocoa.h>
//#import <AppKit/AppKit.h>

@interface MyView : NSView
@end

@implementation MyView
- (id)initWithFrame:(NSRect)frame
{
    self = [super initWithFrame:frame];
    if (self) {
        // Initialization code here.
    }   
    return self;
}

- (void)drawRect:(NSRect)dirtyRect
{

// just paste Scott Stevenson's code here    
    
}
@end

int main(int argc, char * argv[]) {
    MyView *myView;
    NSRect f = NSMakeRect(0,0,400,400);
    myView = [[MyView alloc] initWithFrame:f];
    NSData *data = [myView dataWithPDFInsideRect:f];
    [data writeToFile:@"x.pdf" atomically:YES];
    return 0;
}

Friday, May 25, 2012

Quartz (2)

This second Quartz example is really more about arrays than drawing. I made a simple category on NSArray that returns a random object from any array. We set up three arrays from which to grab random values for the color, the origin and the size of a rectangle to draw. Both the origins and sizes need to be wrapped into NSValue objects in order to store them into the arrays. The original structs can then be retrieved by reference.

In a similar way, the rgb and alpha values are recovered from the NSColor to provide them to CGContextSetRGBFillColor. In playing around with the code I decided to bias toward higher values of red. We use a venerable C construct for that.

// clang x.m -o prog -framework Cocoa -fobjc-gc-only
#import <Cocoa/Cocoa.h>

CGContextRef setup(){
    NSString *s = [NSString stringWithUTF8String:"x.pdf"];
    NSURL *url = [NSURL fileURLWithPath:s];
    CGContextRef ctx = NULL;
    ctx = CGPDFContextCreateWithURL (
                            (CFURLRef)url,
                            NULL,
                            NULL);

    CGRect rect = CGRectMake (0, 0, 500, 400);              
    CGContextBeginPage(ctx, &rect);
    return ctx;
}

@interface NSArray (RandomChoiceArray)
- (id)getRandomObject;
@end;

@implementation NSArray (RandomChoiceArray)
- (id)getRandomObject {
    int n = [self count];
    int i = (random() % n);
    return [self objectAtIndex:i];
}
- (void)testRandom:(int)n {
    for (int i = 0; i < n; i++) {
        NSLog(@"%@", [[self getRandomObject] description]);
    }
}
@end

int main(int argc, char * argv[]) {
    srandom(time(NULL));
    CGContextRef ctx = setup();
    
    NSColorList *CL;
    CL = [NSColorList colorListNamed:@"Crayons"];
    NSArray *colorNames = [CL allKeys];
    //[colorNames testRandom:10];
    
    NSSize hz, vt;
    hz = NSMakeSize(150,50);
    vt = NSMakeSize(50,150);
    NSValue *v, *h;
    h = [NSValue value:&hz withObjCType:@encode(NSSize)];
    v = [NSValue value:&vt withObjCType:@encode(NSSize)];
    NSArray *sizes = [NSArray arrayWithObjects:h, v, nil];
    //[sizes testRandom:4];
    
    NSMutableArray *ma;
    ma = [NSMutableArray arrayWithCapacity:10];
    for (int i = 0; i < 9; i++) {
        for (int j = 0; j < 9; j++) {
            int x = i*50;
            int y = j*50;
            NSPoint pt = NSMakePoint(x,y);
            NSValue *p;
            p = [NSValue value:&pt withObjCType:@encode(NSPoint)];
            [ma addObject:p];
        }
    }
    NSArray *origins = [NSArray arrayWithArray:ma];
    //[origins testRandom:10];
    
    CGFloat r, g, b, a, m;
    m = 0.7;
    for (int i = 0; i < 250; i++) {
        NSString *c = [colorNames getRandomObject];
        NSColor *color = [CL colorWithKey:c];
        id obj = [origins getRandomObject];
        NSPoint pt;
        [obj getValue:&pt];
        NSSize sz;
        obj = [sizes getRandomObject];
        [obj getValue:&sz];
        CGRect rect = CGRectMake(pt.x,pt.y,sz.width,sz.height); 
        //NSLog (@"%@", NSStringFromRect(r));
        
        [color getRed:&r green:&g blue:&b alpha:&a];
        r = r < m ? m : r;
        CGContextSetRGBFillColor (ctx,r,g,b,a);
        CGContextFillRect (ctx,rect);
    }
    CGPDFContextEndPage(ctx);
    CGContextRelease(ctx);
    return 0;
}

Quartz (1)

The next examples I want to do will be about drawing to a pdf in OS X. Currently, I've been compiling code and executing it in Terminal, to try to abstract the essentials away from all the heavy machinery of Xcode.

If we were using Xcode, and had an app with a GUI, then this could be as simple as calling a method of NSView:

NSRect r = [myView bounds];

NSData *data = [myView dataWithPDFInsideRect:r];

[data writeToFile:[sheet filename] atomically:YES];

But I wanted to write data to a pdf from a command line application. That led me to an example that used:

CGPDFContextCreateWithURL

as well as a family of functions to create a CFURLRef:

CFURLCreateWithString

CFURLCreateWithFileSystemPath

These are functions, not methods of Objective-C objects. We're really in C-land here. The CF stands for Core Foundation. One aspect of this that I don't understand is that the docs talk about e.g. a CFURL and CFString, but the actual types are CFURLRef and CFStringRef. I'm not sure why yet.

CFURLCreateWithFileSystemPath takes 4 arguments:

CFURLRef CFURLCreateWithFileSystemPath (

CFAllocatorRef allocator,

CFStringRef    filePath,

CFURLPathStyle pathStyle,

Boolean        isDirectory

);

If you want to be fancy, you can pass kCFAllocatorDefault for the allocator, but this is really equivalent to NULL, and just tells the system to use the default allocator. The second argument is a CFStringRef. See the code example below for how to make one of these. The third argument should be kCFURLPOSIXPathStyle (other options would be kCFURLHFSPathStyle and kCFURLWindowsPathStyle). And the last argumenet is false, because this is not for a directory.

Since the function has "Create" in the name, we "own" the resulting object, and should release it when we don't need it anymore.

However, it turns out to be much simpler than this (although the above works). NSString and NSURL are "toll-free bridged" with CFStringRef and CFURLRef (docs), and can be substituted one for another in any function call.

It took me far too long to discover the secret of this (as the link above shows), you can substitute, but you must also cast, as in (CFURLRef)url.

Finally, having a CFURLRef, we can get the graphics context appropriate for drawing to a pdf. We bracket the drawing code with calls to CGContextBeginPage and CGContextEndPage. The drawing itself is the Apple example and self-explanatory, except that it's important that the transparency (alpha) for the second rectangle is 0.5. That allows us to see some color come through from the layer below it.

I puzzled over the fact that the right side of the figure isn't blue but more like purple. I finally realized that what's happening is the transparency works here too (and for some reason the background is white). The implication is that the order matters, if we draw the red rectangle last (with alpha = 0.5) the result is different, as shown.

An excellent beginner's tutorial for Quartz drawing is here.

// clang prog.m -o prog -framework Cocoa -fobjc-gc-only
#import <Cocoa/Cocoa.h>

int main(int argc, char * argv[]) {
    // method 1
    CFStringRef cfstr = CFSTR("x.pdf");
    CFURLRef urlref = CFURLCreateWithFileSystemPath (
                                    kCFAllocatorDefault,
                                    cfstr,
                                    kCFURLPOSIXPathStyle,
                                    false);
    if (urlref == NULL) { 
        NSLog(@"some problem with the url");
        return 0; 
    }
      
    // method 2
    NSString *s = [NSString stringWithUTF8String:"x.pdf"];
    NSURL *url = [NSURL fileURLWithPath:s];
    urlref = (CFURLRef)url;
            
    CGContextRef myPDFContext = NULL;
    myPDFContext = CGPDFContextCreateWithURL (
                            urlref,
                            NULL,
                            NULL);
    
    CGRect rect = CGRectMake (0, 0, 500, 250);              
    CGContextBeginPage(myPDFContext, &rect);

    CGRect r1, r2, r3, r4;
    r1 = CGRectMake (21,21,100,200);
    r2 = CGRectMake (21,21,200,100);
    
    CGContextSetRGBFillColor (myPDFContext,1,0,0,1);
    CGContextFillRect (myPDFContext, r1);

    CGContextSetRGBFillColor (myPDFContext,0,0,1,0.5);
    CGContextFillRect (myPDFContext, r2);

    r3 = CGRectMake (221,21,100,200);
    r4 = CGRectMake (221,21,200,100);

    CGContextSetRGBFillColor (myPDFContext,0,0,1,1);
    CGContextFillRect (myPDFContext,r4);

    CGContextSetRGBFillColor (myPDFContext,1,0,0,0.5);
    CGContextFillRect (myPDFContext, r3);

    CGContextEndPage(myPDFContext);
    
    //CFRelease(urlref);
    CGContextRelease(myPDFContext);
    return 0;
}

Blocks (6)

I decided to work up two more examples from Mike Ash's Friday Q&A about blocks.

Both examples involve a callback. Your program sets up a situation where you'd like to defer the execution of some code to a later time. The classic example is a file dialog. You're asking the user for input, but don't necessarily halt the program until he makes up his mind. Another one is where you have a bunch of calculations that could be done concurrently. I still have to work up a good example of that.

Here we have a notification in the first example, and a timer in the second. The implementations use a category on NSObject. This works because a block is an Objective-C object. An alternative approach would be to use a function rather than a method.

The examples may look a bit strange. The block object is added as the observer, and its method my_callBlockWithObject or my_CallBlock is called when the notification appears or the timer finishes. It looks strange because in the method, the block object is obtained with self and then called.

Notice the use of [block copy].

// clang blocks.m -o prog -framework Foundation -fobjc-gc-only
#import 

typedef void (^B)(NSNotification *note);
B b = ^(NSNotification *note){ NSLog(@"Note!"); };

@implementation NSObject (BlocksAdditions)

- (void)my_callBlockWithObject:(id)arg {
    B b = (id)self;
    b(arg);
}
@end

@implementation NSNotificationCenter (BlocksAdditions)
    
- (void)addObserverForName: (NSString *)name 
                    object: (id)obj
                     block: (B)b {
                         
    [self addObserver: [b copy] 
             selector: @selector(my_callBlockWithObject:)
                 name: name 
               object: obj];
}
@end

int main(int argc, char * argv[]) {
    NSNotificationCenter *cent = [NSNotificationCenter defaultCenter];
    [cent addObserverForName: @"MyNotification"
                      object: nil
                       block: b];
                                                       
    [cent postNotificationName:@"MyNotification" object:nil];
    return 0;
}

> ./prog2012-05-21 07:40:28.601 prog[3441:707] Note!

// clang blocks.m -o prog -framework Foundation -fobjc-gc-only
#import 

typedef void (^B)();

@implementation NSObject (BlocksAddition)
- (void) my_CallBlock {
    B b = (id)self;
    b();
}
@end

void RunAfterDelay(NSTimeInterval delay, B b) {
    [[b copy] performSelector:@selector(my_CallBlock)
                   withObject:nil
                   afterDelay:delay ];
}

int main(int argc, char * argv[]) {
    NSString *s = @"world!";
    RunAfterDelay(0, ^{ NSLog(@"Hello %@", s);  });
    
    NSRunLoop* rl = [NSRunLoop currentRunLoop];
    [rl runUntilDate:[NSDate dateWithTimeIntervalSinceNow:0.5]];
    return 0;
}

> ./prog
2012-05-21 07:41:00.457 prog[3455:707] Hello world!

Saturday, May 19, 2012

Blocks (5)

Here is a last example of basic blocks in Objective-C. It comes from Mike Ash's excellent Friday Q&A about blocks. We implement the functional programming tool map as a category on NSArray.

The map method takes a block as its single argument. This is the fundamental unit of work, which in the example will operate on one element in an array of NSStrings, adding @"x" to it. The block argument to map is declared with an id argument and an id return value; these will be NSStrings when the block is defined at runtime. A different block could be fed to the same map function, and the elements wouldn't have to be strings.

map applies the block to each element of the array, storing the results in an NSMutableArray and returning it.

We use enumerateObjectsUsingBlock: to examine the values, although we could just use for (id obj in array).

// clang blocks.m -o prog -framework Foundation -fobjc-gc-only
#import <Foundation/Foundation.h>

@interface NSArray (MappingArray)
- (NSArray *)map:(id (^)(id))block;
@end;

@implementation NSArray (MappingArray)
- (NSArray *)map:(id (^)(id))block { // takes an id, returns an id
        NSMutableArray *ret = [NSMutableArray array];
        for(id obj in self)
            [ret addObject:block(obj)];
        return ret;
    }
@end

int main(int argc, char * argv[]) {
    NSArray *a, *b;
    a = [NSArray arrayWithObjects:@"a",@"b",@"c",nil];
    b = [a map:^(id obj){ return [obj stringByAppendingString:@"x"]; }];
    
    [b enumerateObjectsUsingBlock:^(id obj, NSUInteger idx, BOOL *stop) {
        NSLog(@"%@", obj);
        } ];
    
    return 0;
}

> ./prog
2012-05-19 10:00:13.537 prog[3426:707] ax
2012-05-19 10:00:13.539 prog[3426:707] bx
2012-05-19 10:00:13.540 prog[3426:707] cx

Friday, May 18, 2012

Blocks (4)

In the spirit of quick Objective-C, here is another example using blocks. We feed a bunch of blocks to NSOperationQueue. According to the docs

The block should take no parameters and have no return value.

So then the question is what can the block do? In this example, we get values from a loop counter and also call a function defined in the same file.

A run loop is required in order to see any output (and the output is truncated if the run loop's time interval is shortened).

// clang blocks.m -o prog -framework Foundation -fobjc-gc-only
#import <Foundation/Foundation.h>

void f() { printf("."); }
typedef void (^B)();

int main(int argc, char * argv[]) {
    NSOperationQueue *q = [NSOperationQueue mainQueue];
    int N = 100;
    __block int c = 0;
    NSLog(@"%p", &f);
    NSLog(@"%p", &main);
    NSLog(@"%p", &c);
    B b = ^() { 
        f();
        c++;
        if (c && !(c%(N/10))) {
            printf("c = %3d\n", c); 
            }
        };
    NSLog(@"%p", &b);
    B b2 = [b copy];
    NSLog(@"%p", &b2);

    for (int i = 0; i < N; i++) {
        [q addOperation:[NSBlockOperation blockOperationWithBlock:b]];
    }
    NSRunLoop* rl = [NSRunLoop currentRunLoop];
    NSLog(@"%p", &rl);
    [rl runUntilDate:[NSDate dateWithTimeIntervalSinceNow:0.2]];
    return 0;
}

Output:

> ./prog
2012-05-19 08:53:26.881 prog[2988:707] 0x103da99c0
2012-05-19 08:53:26.883 prog[2988:707] 0x103da99e0
2012-05-19 08:53:26.883 prog[2988:707] 0x7fff639a7ba8
2012-05-19 08:53:26.884 prog[2988:707] 0x7fff639a7b88
2012-05-19 08:53:26.884 prog[2988:707] 0x7fff639a7b88
2012-05-19 08:53:26.885 prog[2988:707] 0x7fff639a7b50
2012-05-19 08:53:26.887 prog[2988:707] 0x7fff639a7b40
..........c =  10
..........c =  20
..........c =  30
..........c =  40
..........c =  50
..........c =  60
..........c =  70
..........c =  80
..........c =  90
..........c = 100

From what I understand, I expected the first block to be in a different place than the copy, but they both appear to be on the heap. The first two variables, main and f, are on the stack.

This isn't a particularly useful example. I have to look into things some more to find a good toy example for concurrent execution.

Blocks (3)

Here is one more simple use of blocks in Objective-C (previous posts here and here).

The docs for NSArray give these three functions using blocks:

enumerateObjectsUsingBlock:

enumerateObjectsWithOptions:usingBlock:

enumerateObjectsAtIndexes:options:usingBlock:

As the reference indicates, these methods require a particular form of block with this signature:

^(id s, NSUInteger i, BOOL *stop)

(Apple uses idx for i). The stop flag is meant to tbe used to bail from an iteration when the first matching object is found. The index is not a loop counter but shows the index of the current object within the array.

// clang blocks2.m -o prog -framework Foundation -fobjc-gc-only
#import <Foundation/Foundation.h>

typedef void (^B) (id s, NSUInteger idx, BOOL *stop);

int main(int argc, char * argv[]){
    B b = ^(id s, NSUInteger i, BOOL *stop) { 
        NSLog(@"i = %lu, pre : s = %@", i, s);
        if ([s caseInsensitiveCompare:@"B"] == 0) {
            NSLog(@"setting stop, s = %@", s);
            *stop = YES;
        }
        NSLog(@"post: s = %@", s);
    };
    
    NSArray *arr = [NSArray arrayWithObjects:@"a",@"B",@"c",@"d",nil];
    [arr enumerateObjectsUsingBlock:b];
    printf("------------------------------\n");
    //NSUint o = NSEnumerationReverse;
    uint o = NSEnumerationReverse;
    [arr enumerateObjectsWithOptions:o
                          usingBlock:b];
    printf("------------------------------\n");
    NSIndexSet *is;
    NSRange R = NSMakeRange(2,2);
    is = [NSIndexSet indexSetWithIndexesInRange:R];
    [arr enumerateObjectsAtIndexes:is
                           options:NO
                        usingBlock:b];
    return 0;
}

> ./prog
2012-05-18 09:02:07.791 prog[382:707] i = 0, pre : s = a
2012-05-18 09:02:07.793 prog[382:707] post: s = a
2012-05-18 09:02:07.794 prog[382:707] i = 1, pre : s = B
2012-05-18 09:02:07.795 prog[382:707] setting stop, s = B
2012-05-18 09:02:07.795 prog[382:707] post: s = B
------------------------------
2012-05-18 09:02:07.796 prog[382:707] i = 3, pre : s = d
2012-05-18 09:02:07.796 prog[382:707] post: s = d
2012-05-18 09:02:07.797 prog[382:707] i = 2, pre : s = c
2012-05-18 09:02:07.797 prog[382:707] post: s = c
2012-05-18 09:02:07.798 prog[382:707] i = 1, pre : s = B
2012-05-18 09:02:07.798 prog[382:707] setting stop, s = B
2012-05-18 09:02:07.799 prog[382:707] post: s = B
------------------------------
2012-05-18 09:02:07.799 prog[382:707] i = 2, pre : s = c
2012-05-18 09:02:07.800 prog[382:707] post: s = c
2012-05-18 09:02:07.800 prog[382:707] i = 3, pre : s = d
2012-05-18 09:02:07.801 prog[382:707] post: s = d

Another NSArray method that uses a block is sortedArrayUsingComparator. Here's an example adapted from the Apple docs, followed by a slightly modified approach.

// clang blocks3.m -o prog -framework Foundation -fobjc-gc-only

#import <Foundation/Foundation.h>

void show(NSArray *a) {
    for (id obj in a) { 
        NSLog(@"%@", [obj description]);
        }   
}

int main(int argc, char * argv[]) {
    NSArray *a;
    a = [NSArray arrayWithObjects:@"1",@"21",@"12",@"11",@"02",nil];
    NSStringCompareOptions o1, o2, o3, o4;
    o1 = NSCaseInsensitiveSearch;
    o2 = NSNumericSearch;
    o3 = NSWidthInsensitiveSearch;
    o4 = NSForcedOrderingSearch;
    NSStringCompareOptions o = o2;
    NSLocale *loc = [NSLocale currentLocale];
    NSComparator cmp = ^(id s1, id s2) {
        NSRange R = NSMakeRange(0, [s1 length]);
        return [s1 compare:s2 options:o range:R locale:loc];
    };
    NSArray *result = [a sortedArrayUsingComparator:cmp];
    show(result);
    
    NSComparator cmp2 = ^(id s1, id s2) {
        NSNumber *x = [NSNumber numberWithInt:[s1 intValue]];
        NSNumber *y = [NSNumber numberWithInt:[s2 intValue]];
        return (NSComparisonResult)[x compare:y];
    };
    result = [a sortedArrayUsingComparator:cmp2];
    show(result);
    return 0;
}

Both versions give the same output:

2012-05-18 13:31:41.071 prog[902:707] 1
2012-05-18 13:31:41.073 prog[902:707] 02
2012-05-18 13:31:41.074 prog[902:707] 11
2012-05-18 13:31:41.074 prog[902:707] 12
2012-05-18 13:31:41.075 prog[902:707] 21

One more NSArray method that uses a block is indexesOfObjectsPassingTest. This example is also adapted from the Apple docs. It's followed by use of a block and enumerateIndexesUsingBlock to enumerate the resulting NSIndexSet.

// clang blocks5.m -o prog -framework Foundation -fobjc-gc-only

#import <Foundation/Foundation.h>

int main(int argc, char * argv[]) {
    NSString *s = [NSString stringWithFormat:@"A B C A B Z G Q"];
    NSArray *a = [s componentsSeparatedByString:@" "];
    NSSet *filterSet = [NSSet setWithObjects:@"A", @"C", @"Q", nil];
    
    BOOL (^test) (id obj, NSUInteger idx, BOOL *stop);
    test = ^ (id obj, NSUInteger idx, BOOL *stop) {
        if (idx < 5) {
            if ([filterSet containsObject:obj]) {
                return YES;
            }
        }
        return NO;
    };
    NSIndexSet *iset = [a indexesOfObjectsPassingTest:test];
    [iset enumerateIndexesUsingBlock:^(NSUInteger idx, BOOL *stop) {
        NSLog(@"%lu", idx);
        } ];
    return 0;
}

Output:

> ./prog
2012-05-18 17:03:26.348 prog[1477:707] 0
2012-05-18 17:03:26.351 prog[1477:707] 2
2012-05-18 17:03:26.351 prog[1477:707] 3

Thursday, May 17, 2012

Blocks (2)

Continuing from the previous post, here we see the ability of a block to access (and modify) variables in the enclosing scope. The first type of block takes an int, multiplies it by a second int defined in main but not passed as an argument, and returns the result. It also accesses the variable c (a char *) in main, and changes its value. The change happens before the second block runs as shown by the result of strcmp, which returns 0 for a match, resulting in the assignment of c to a third value.

// clang blocks.m -o prog -framework Foundation -fobjc-gc-only
#import <Foundation/Foundation.h>

typedef int (^A) (int x);
typedef void (^B) ();

int main(int argc, char * argv[]){
    int y = 7;
    __block char *c = "abc\0";
    printf("initial: %s\n", c);
    const char *c2 = "def\0";
    
    A a = ^(int x) { 
        c = (char *) c2;
        return x*y;
    };
    
    B b = ^() { 
        if (!(strcmp(c,c2))) {
            c = "ghi\0"; 
        }
    };
    
    printf("a(2) returns: %i\n", a(2));
    printf("before b(): %s\n", c);
    b();
    printf("after b(): %s\n", c);
    return 0;
}

Output:

> ./prog
initial: abc
a(2) returns: 14
before b(): def
after b(): ghi

Blocks (1)

Lately I've been working on improving my Objective-C coding skills. Although I like PyObjC a lot, it has some issues. I'm thinking I will post some of my exercises on the blog under Quick Objective-C.

To help focus my efforts, I bought a copy of Aaron Hillegass's book
Objective-C Programming.

I like this book a lot. It's a basic introduction that has lots of C as well, but the material has been finely honed by many hours in the classroom. It doesn't cover GUI stuff very much, that's for his other book, the latest edition of which is by Hillegass & Preble.

In order to keep things simple I am not using Xcode. I just invoke the clang compiler from the command line, usually with a #import <Foundation/Foundation.h>, and I use garbage collection.

The example for today is a very basic block, which comes rather late in the book. But before we do that, here is an example of a function pointer in C. We have, in order, a function that takes a single const char * argument, a typedef for a function pointer for functions of that signature, and a function that takes such a function pointer as an argument.

// clang fptr.c -o prog
#include <stdio.h>

void speak(const char *s) { printf("%s", s); }

typedef void (*fptr)(const char *);

void g(fptr f, const char *s){ f(s); }

int main(int argc, const char *argv[]){
    speak("abc\n");
    fptr f = &speak;
    f("def\n");
    g(f, "ghi\n");
    g(speak, "jkl\n");
    return 0;
}

The output is as expected:

> ./prog
abc
def
ghi
jkl

The typedef allows us to declare variables of type fptr. Since the signature for speak matches, we can assign f to the address of speak, and call it just as we would the original function. Either the pointer (or the original function) can be passed into g.

A basic block is declared and defined in a way similar to this. It turns a function into an object that can be passed around, for example, into functions. Of course, blocks are much more powerful than this (e.g. bbum's post).

Here is a basic block. We go to the trouble of making a typedef for this type of block, and define a function f that takes such a block as an argument.

// clang blocks.m -o prog -framework Foundation -fobjc-gc-only
#include <Foundation/Foundation.h>

typedef void (^B) (const char *s);

void f(B b, const char *s){ b(s); }

int main(int argc, char * argv[]) {
    B b = ^(const char *s) {
        printf("%s", s);
    };
    f(b,"abc\n");
    return 0;
}

> ./prog
abc

The similarity to function pointers is pretty clear. The typedef isn't necessary, of course. And if a block is to be used only once, it might even be anonymous, something like a Python lambda.

No typedef:

retval (^block_var) (named_args) = ^(named_args) { ..code.. };

Anonymous versions:
[obj methodWithArg:arg1 block:^(named_args) { ..code.. }];
[obj methodWithArg:arg1 block:(retval (^)(named_args)) { ..code.. }];

It's worth pointing out that the return type is not required to be specified.

An standard example of an anonymous block is to handle the result of a file save dialog (as described in my post here).

[savePanel beginSheetModalForWindow:[NSApp mainWindow]
                  completionHandler:^(NSInteger result) {
                      if (result == NSOKButton) {
                          [savePanel orderOut:self];
                          [self application:NSApp 
                            saveFile:[savePanel filename]];
                      }
                  }];

I'll show an example of capturing state from the enclosing scope in another post.

Thursday, May 3, 2012

Unicode is weird fun!

I came across something fun in an SO answer the other day:

˙sɯǝlqoɹd ʎuɐ pɐɥ ɹǝʌǝu ǝʌ,I puɐ uʍop ǝpısdn pǝʇunoɯ ǝɹɐ sǝɥɔʇıʍs ʞɹoʍʇǝu ʎɯ ɟo ll∀'

That's pretty entertaining! I pasted it into the Python interpreter:

>>> s = '˙sɯǝlqoɹd ʎuɐ pɐɥ ɹǝʌǝu ǝʌ,I puɐ uʍop ǝpısdn pǝʇunoɯ ǝɹɐ sǝɥɔʇıʍs ʞɹoʍʇǝu ʎɯ ɟo ll∀'
>>> s
'\xcb\x99s\xc9\xaf\xc7\x9dlqo\xc9\xb9d \xca\x8eu\xc9\x90 p\xc9\x90\xc9\xa5 \xc9\xb9\xc7\x9d\xca\x8c\xc7\x9du \xc7\x9d\xca\x8c,I pu\xc9\x90 u\xca\x8dop \xc7\x9dp\xc4\xb1sdn p\xc7\x9d\xca\x87uno\xc9\xaf \xc7\x9d\xc9\xb9\xc9\x90 s\xc7\x9d\xc9\xa5\xc9\x94\xca\x87\xc4\xb1\xca\x8ds \xca\x9e\xc9\xb9o\xca\x8d\xca\x87\xc7\x9du \xca\x8e\xc9\xaf \xc9\x9fo ll\xe2\x88\x80'

According to this, the flipped version of 'a' ('0x61') is '\u0250', in UTF-8 that's '\xc9\x90'. (There are a bunch of other mappings). You can see these two bytes in the hex data right after ' p': '\xc9\x90'.

But it's going to take some 'splainin'.

This is a great place to start. The thing about Unicode is that it is fundamentally just a listing of integer values and the corresponding "code points." Just take all the interesting symbols you can find and organize them and then assign an integer to one after another. That's not hard to understand. What's hard is when you get into representing these integers in multi-byte encodings (e.g. UTF-8) and, of course, in printing them.

We can get some info about this in Python using the unicodedata module.

>>> import unicodedata
>>> i = 592
>>> hex(i)
'0x250'
>>> c = unichr(i)
>>> print c
ɐ
>>> c.encode('utf8')
'\xc9\x90'
>>> '\xc9\x90'.decode('utf8')
u'\u0250'
>>> unicodedata.name(c)
'LATIN SMALL LETTER TURNED A'
>>>

The unicode "value" is 2*16*16 + 5*16 = 512 + 80 = 592.

What makes it complicated is the gyrations involved in representing this in various other formats, like UTF-8 or HTML ('&' + '#592;'). I grabbed the defs for the flipTable from the reference above.

Then I went exploring in Python (see the script for details):

> python script.py 
a
from flipTable:
ɐ    592 \u0250 LATIN SMALL LETTER TURNED A
from unicodedata:
ɐ    (592, '0x250', u'\u0250', 'LATIN SMALL LETTER TURNED A')
ᵄ    (7492, '0x1d44', u'\u1d44', 'MODIFIER LETTER SMALL TURNED A')
..
e
from flipTable:
ǝ    477 \u01DD LATIN SMALL LETTER TURNED E
from unicodedata:
ǝ    (477, '0x1dd', u'\u01dd', 'LATIN SMALL LETTER TURNED E')
ᴈ    (7432, '0x1d08', u'\u1d08', 'LATIN SMALL LETTER TURNED OPEN E')
ᵌ    (7500, '0x1d4c', u'\u1d4c', 'MODIFIER LETTER SMALL TURNED OPEN E')
ⱻ    (11387, '0x2c7b', u'\u2c7b', 'LATIN LETTER SMALL CAPITAL TURNED E')

Many of the "flips" are silly ('q' for 'b' and so on), but lots of them are absolutely right. Finally, my Python doesn't seem to know about anything i >= 2**16.

>>> i = 2**16
>>> i
65536
>>> unichr(i)
Traceback (most recent call last):
  File "", line 1, in 
ValueError: unichr() arg not in range(0x10000) (narrow Python build)
>>> i -= 1
>>> unichr(i)
u'\uffff'
>>> print unichr(i)

>>>
script.py

from collections import defaultdict
import unicodedata
from flip import flipTable

letters = 'abcdefghijklmnopqrstuvwxyz'
L = list(letters)
D = defaultdict(list)

def caps(s):
    L = [c.capitalize() for c in s]
    return ''.join(L)

for i in range(1,2**16):
    u = unichr(i)
    try:
        n = unicodedata.name(u)
    except ValueError:
        continue
    v = n.lower()
    if 'turned' in v:
        k = v.split()[-1]
        if len(k) > 1:
            continue
        if not 'small' in v:
            k = k.upper()
        D[k].append((i,hex(i),u,n))

for s in letters:
    print s
    c = '\u00' + caps(hex(ord(s))[2:])
    try:
        v = flipTable[c]
        i = int('0x' + v[2:],16)
        u = unichr(i)
        print 'from flipTable:'
        print u.ljust(4), i, v, 
        print unicodedata.name(u)
    except KeyError:
        pass
    if D[s]:
        print 'from unicodedata:'
    for e in D[s]:
        print e[2].ljust(4), e
    print '-' * 10

Tuesday, May 1, 2012

gpg

This post is about the Gnu Privacy Guard (gpg). A link from the main page leads to a download of an installer for OS X here.

I followed the Quick start tutorial here, and sent an encrypted email between two of my accounts using Mail after just a few minutes. Nice!

The lock and signature icons in the lower right-hand corner are only active if the recipient's public key is available.

Let's explore a bit in Terminal
generate a new key with a spectacularly weak passphrase: abc

> gpg --gen-key
gpg (GnuPG/MacGPG2) 2.0.18; Copyright (C) 2011 Free Software Foundation, Inc.
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.

Please select what kind of key you want:
   (1) RSA and RSA (default)
   (2) DSA and Elgamal
   (3) DSA (sign only)
   (4) RSA (sign only)
Your selection? 1
RSA keys may be between 1024 and 4096 bits long.
What keysize do you want? (2048)   
Requested keysize is 2048 bits   
Please specify how long the key should be valid.
         0 = key does not expire
        = key expires in n days
      w = key expires in n weeks
      m = key expires in n months
      y = key expires in n years
Key is valid for? (0) 4y
Key expires at Thu Apr 28 10:11:26 2016 EDT
Is this correct? (y/N) y
                        
GnuPG needs to construct a user ID to identify your key.

Real name: Alice
Email address: alice@email.com
Comment:                      
You selected this USER-ID:
    "Alice "

Change (N)ame, (C)omment, (E)mail or (O)kay/(Q)uit? O
You need a Passphrase to protect your secret key.    

We need to generate a lot of random bytes. It is a good idea to perform
some other action (type on the keyboard, move the mouse, utilize the
disks) during the prime generation; this gives the random number
generator a better chance to gain enough entropy.
gpg: key 6CC18DC3 marked as ultimately trusted
public and secret key created and signed.

gpg: checking the trustdb
gpg: 3 marginal(s) needed, 1 complete(s) needed, PGP trust model
gpg: depth: 0  valid:   4  signed:   0  trust: 0-, 0q, 0n, 0m, 0f, 4u
gpg: next trustdb check due at 2015-08-18
pub   2048R/6CC18DC3 2012-04-29 [expires: 2016-04-28]
      Key fingerprint = A84B 4F1D A439 849F 47F1  55F9 BF7C FB80 6CC1 8DC3
uid                  Alice 
sub   2048R/DC9C92B3 2012-04-29 [expires: 2016-04-28]

I have a feeling that this fake user's data may have been sent to the key server, and that's not nice. But I expected to get a prompt related to that and I didn't get one.

Let's take a quick look at what we have:

> gpg -k Alice
pub   2048R/6CC18DC3 2012-04-29 [expires: 2016-04-28]
uid                  Alice 
sub   2048R/DC9C92B3 2012-04-29 [expires: 2016-04-28]

> gpg -k --fingerprint Alice
pub   2048R/6CC18DC3 2012-04-29 [expires: 2016-04-28]
      Key fingerprint = A84B 4F1D A439 849F 47F1  55F9 BF7C FB80 6CC1 8DC3
uid                  Alice 
sub   2048R/DC9C92B3 2012-04-29 [expires: 2016-04-28]

>

-a is "ascii armored" base64 output:

> gpg -a --export 6CC18DC3
-----BEGIN PGP PUBLIC KEY BLOCK-----
Version: GnuPG/MacGPG2 v2.0.18 (Darwin)
Comment: GPGTools - http://gpgtools.org

mQENBE+dTEUBCADWaa0ikPdHmp7OONsuhUJeIIr..
..
-----END PGP PUBLIC KEY BLOCK-----
>
> gpg -a --export-secret-key 6CC18DC3
-----BEGIN PGP PRIVATE KEY BLOCK-----
Version: GnuPG/MacGPG2 v2.0.18 (Darwin)
Comment: GPGTools - http://gpgtools.org

lQO+BE+dTEUBCADWaa0ikPdHmp7OONsuhUJeIIr..
..
-----END PGP PRIVATE KEY BLOCK-----

While this looks quite similar to RSA (my posts), the Python rsa module won't handle it as is. I didn't dissect it by hand yet. There's something called pgpdump that is recommended on the web.

It builds easily and then I can do:

gpg --export 6CC18DC3 | pgpdump -i
gpg --export 6CC18DC3-secret-key | pgpdump -i

but I didn't find the output all that useful yet. It shows multiple values for n and e but doesn't show d or p and q.

Try some encryption:

> gpg -e -r 6CC18DC3 m.txt
> hexdump -C m.txt
00000000  68 65 6c 6c 6f 2c 20 77  6f 72 6c 64 21 0a        |hello, world!.|
0000000e
> hexdump -C m.txt.gpg
00000000  85 01 0c 03 a9 47 9a 63  dc 9c 92 b3 01 07 ff 57  |.....G.c.......W|
00000010  86 9b 5d a0 40 d1 f0 ef  5a 6f dc eb 19 a9 eb 8c  |..].@...Zo......|
00000020  ee 66 a7 34 84 e4 47 5b  6c 48 9f 9e 89 13 4c 2a  |.f.4..G[lH....L*|
00000030  71 6a 31 b7 27 23 9d 56  a7 c2 ad fd db 47 57 68  |qj1.'#.V.....GWh|
00000040  da 75 9a 2d 2f f6 46 60  16 84 b6 17 bf e7 b7 5c  |.u.-/.F`.......\|
00000050  36 fd d1 e2 22 ee 93 dc  ad 82 f5 f1 46 99 12 f3  |6...".......F...|
00000060  fe 25 a1 b3 01 8c 37 a0  59 da ac 39 90 a4 1c ba  |.%....7.Y..9....|
00000070  a0 4f 1e b6 da d5 36 55  b1 17 d6 c4 5a 28 de b4  |.O....6U....Z(..|
00000080  47 b2 af 8a c8 9c 58 85  44 f8 08 fe a1 47 c3 8f  |G.....X.D....G..|
00000090  4d b1 78 50 87 dc a7 7f  55 89 f2 6e 7f 75 ae a0  |M.xP....U..n.u..|
000000a0  69 68 46 5a 64 1e b4 6e  c7 ee 84 77 8d a4 ce 14  |ihFZd..n...w....|
000000b0  72 45 13 be d0 33 5c d6  23 6f 2d b2 84 2f d9 55  |rE...3\.#o-../.U|
000000c0  f7 de d2 8f b6 20 5b 71  4e 31 ae b8 d7 1b 09 bf  |..... [qN1......|
000000d0  80 9e e0 1f 47 cb 73 a1  59 42 81 24 1f 2b de 4b  |....G.s.YB.$.+.K|
000000e0  0d 23 fc c6 a2 83 5e c2  b3 e5 9f 1f 32 ae 75 07  |.#....^.....2.u.|
000000f0  79 7f 51 49 02 80 a8 47  c4 5c b6 6f aa ac d4 5c  |y.QI...G.\.o...\|
00000100  e7 c9 b6 1f d2 c1 7e 03  45 34 59 85 d1 63 01 d2  |......~.E4Y..c..|
00000110  4e 01 27 2e e9 09 aa 82  5d 77 56 82 22 4e 2e 67  |N.'.....]wV."N.g|
00000120  1c 4a bc ba c1 43 d6 f0  86 02 5d e7 b3 58 74 79  |.J...C....]..Xty|
00000130  bc 15 69 d4 44 ba f6 76  0c a7 a1 d5 9b 1b e0 8b  |..i.D..v........|
00000140  b6 7b df db b0 5f e6 34  0b 36 14 0b fd c6 62 f3  |.{..._.4.6....b.|
00000150  16 a8 97 ad 92 e7 4e a4  ee ab 59 53 91 c6 52     |......N...YS..R|
0000015f

> gpg -d -o p.txt m.txt.gpg

You need a passphrase to unlock the secret key for
user: "Alice "
2048-bit RSA key, ID DC9C92B3, created 2012-04-29 (main key ID 6CC18DC3)

gpg: encrypted with 2048-bit RSA key, ID DC9C92B3, created 2012-04-29
      "Alice "

> cat p.txt
hello, world!

I found what looks to be an excellent tutorial on the web. I still need to work through it.

pycrypto

Here's a Quick Python post about encryption using the pycrypto module. Much of it is derived from the excellent overview here.

I tried easy_install first, but it hung for some reason, so I just did:

git clone https://github.com/dlitz/pycrypto.git
cd pycrypto
python setup.py build
sudo python setup.py install

Let's try a little DES:

>>> from Crypto.Cipher import DES
>>> des = DES.new('01234567',DES.MODE_ECB)
>>> text = 'hello, world'
>>> c = des.encrypt(text)
Traceback (most recent call last):
  File "", line 1, in 
ValueError: Input strings must be a multiple of 8 in length
>>> len(text)
12
>>> text += '.' * (8 - len(text) % 8)
>>> c = des.encrypt(text)
>>> c
'\xc4\xf56\xad\xbb\xae\xb8\x84\xbf\xb3\xfaH^\xe3\x10\x16'
>>> des.decrypt(c)
'hello, world....'
>>>

Here is a different mode of DES called CFB (Cipher Feedback). It requires some random bytes, and two instantiations of the object:

>>> from Crypto.Cipher import DES
>>> from Crypto import Random
>>> iv = Random.get_random_bytes(8)
>>> des1 = DES.new('01234567', DES.MODE_CFB, iv)
>>> des2 = DES.new('01234567', DES.MODE_CFB, iv)
>>> text = 'hello, world!'
>>> c = des1.encrypt(text)
>>> c
'\x92\xc6\xc7K=\xb0\xf4\x83A\xfd\xa4\x13e'
>>> des2.decrypt(c)
'hello, world!'

The venerable xor method is also present:

>>> from Crypto.Cipher import XOR
>>> key = '\x00'*4 + '\x01'*4 + '\x10'*4 + '\x11'*4
>>> xor = XOR.new(key)
>>> text = '\x00\x01\x10\x11' * 4
>>> c = xor.encrypt(text)
>>> p = xor.decrypt(c)
>>> text
'\x00\x01\x10\x11\x00\x01\x10\x11\x00\x01\x10\x11\x00\x01\x10\x11'
>>> key
'\x00\x00\x00\x00\x01\x01\x01\x01\x10\x10\x10\x10\x11\x11\x11\x11'
>>> c
'\x00\x01\x10\x11\x01\x00\x11\x10\x10\x11\x00\x01\x11\x10\x01\x00'
>>> p
'\x00\x01\x10\x11\x00\x01\x10\x11\x00\x01\x10\x11\x00\x01\x10\x11'

I adapted one of the scripts from Laurent Luce's post to do encryption on a file. Here is the output (the hex values of the key we generated), and some display of the data using hexdump:

> python script.py 
21 31 e4 25 b1 ab be fb 3d 35 95 f7 8b 67 ba 24
> hexdump -C m.txt
00000000  48 65 6c 6c 6f 2c 20 77  6f 72 6c 64 21 0a        |Hello, world!.|
0000000e
> hexdump -C c.txt
00000000  25 bc 08 25 ae 7c fb c3  3a a3 91 83 6a 34 7c 06  |%..%.|..:...j4|.|
00000010
> hexdump -C p.txt
00000000  48 65 6c 6c 6f 2c 20 77  6f 72 6c 64 21 0a 20 20  |Hello, world!.  |
00000010

script.py

from Crypto import Random
from Crypto.Cipher import DES3
import struct

def encrypt_file(ifn, ofn, chunk_size, key, iv):
    des3 = DES3.new(key, DES3.MODE_CFB, iv)
    with open(ifn, 'r') as in_file:
        with open(ofn, 'w') as out_file:
            while True:
                chunk = in_file.read(chunk_size)
                if len(chunk) == 0:
                    break
                elif len(chunk) % 16 != 0:
                    chunk += ' ' * (16 - len(chunk) % 16)
                out_file.write(des3.encrypt(chunk))
 
def decrypt_file(ifn, ofn, chunk_size, key, iv):
    des3 = DES3.new(key, DES3.MODE_CFB, iv)
    with open(ifn, 'r') as in_file:
        with open(ofn, 'w') as out_file:
            while True:
                chunk = in_file.read(chunk_size)
                if len(chunk) == 0:
                    break
                out_file.write(des3.decrypt(chunk))


SZ = 100
ifn = 'm.txt'
ofn = 'c.txt'
key = Random.get_random_bytes(16)
iv = Random.get_random_bytes(8)

L = [struct.unpack('B',k) for k in key]
L = [hex(t[0])[2:] for t in L]
print ' '.join(L)

encrypt_file(ifn, ofn, SZ, key, iv)
decrypt_file(ofn, 'p.txt', SZ, key, iv)

bytearray

I came across a post by Python guru David Beazley introducing the bytearray data type. Here is one of his examples, modified slightly:

>>> s = "Hello World"
>>> b = bytearray(s)
>>> b
bytearray(b'Hello World')
>>> b[:5] = "Cruel"
>>> b
bytearray(b'Cruel World')
>>> for i in b:
...     print i
... 
67
114
117
101
108
32
87
111
114
108
100
>>> b.append(33)
>>> b
bytearray(b'Cruel World!')
>>> print b
Cruel World!

The usage b"mystring" was new to me as well. According to this SO answer:

the b prefix for string literals is ineffective in 2.6, but it is a useful marker in the program, which flags explicitly the intent of the programmer to have the string as a data string rather than a text string. This info can then be used by the 2to3 converter or similar utilities when the program is ported to Py3k.

More info here (thanks, Ned).

A bytearray has the methods of lists including __getitem__ and __setitem__. Unusually for Python, one can use arguments of both int and str types. The actual type of a bytearray item is < type 'int'>.

>>> b = bytearray('Hello World!')
>>> b[-1] = '*'
>>> b
bytearray(b'Hello World*')
>>> ord('!')
33
>>> b[-1] = 33
>>> b
bytearray(b'Hello World!')

We can use a bytearray to do a small job of encryption in a very convenient way:

>>> import numpy as np
>>> msg = bytearray("Hello World")
>>> n = len(msg)
>>> L = np.random.randint(0,256,n)
>>> key = bytearray(list(L))
>>> len(key)
11
>>> key
bytearray(b'N\x963\x8a\x06\xf2\xe9\x92\xb9\xf4L')
>>>
>>>
>>> ctext = bytearray()
>>> for c,k in zip(msg,key):
...     ctext.append(c ^ k)
... 
>>> ctext
bytearray(b'\x06\xf3_\xe6i\xd2\xbe\xfd\xcb\x98(')
>>> ptext = bytearray()
>>> for c,k in zip(ctext,key):
...     ptext.append(c ^ k)
... 
>>> ptext
bytearray(b'Hello World')

Monday, April 30, 2012

KS test (2)

A bit more about the Kolmogorov-Smirnov test after the first post here. I found a pdf at MIT ocw about this. I don't understand much of it but it's clear that using the absolute value of the difference to get the statistic is correct.

Now we need to figure out the distribution of the statistic under the null hypothesis, so we can set our type I error at 0.05. From my perspective, the way to do this is to do a simulation in Python. The code below does the following N times: grab n=20 numbers from the standard normal and sort them, compute the D statistic as we did in the previous post, and accumulate those values in a list.

We sort the list, then find the place in the list where the statistic for the test case is found. (I used a function from the bisect module for this, but the naive method gives the same result).

I get a p-value of 0.25. This does not match what we had before (0.36), and I am not sure why that is.

To do a little more checking, I found an online reference that gives p-values for various statistic,n combinations.

for n = 20

p-value   D
0.20   0.210
0.15   0.246
0.10   0.264
0.05   0.294
0.01   0.356

According to this, a statistic of 0.20 (what we had) gives a p-value a bit larger than 0.20, but not much. So this source doesn't agree with what we got from R and Python. Maybe the online source is for a one-sided test (it doesn't state clearly), that would give a smaller p-value for a given statistic.

I added calculations for our simulation of d given various p, and vice-versa. The output:

> python script.py
d = 0.199 p = 0.25
d = 0.246 p = 0.10
d = 0.264 p = 0.06
d = 0.294 p = 0.03

p = 0.15 d = 0.226
p = 0.10 d = 0.245
p = 0.05 d = 0.275
p = 0.01 d = 0.336

Not sure what the correct answer is here, but I can't find a problem with the code and I feel like it's time to move on. One last bit is something to think about that is quoted here:

The Kolmogorov-Smirnov test is based on a simple way to quantify the discrepancy between the observed and expected distributions. It turns out, however, that it is too simple, and doesn't do a good job of discriminating whether or not your data was sampled from a Gaussian distribution. An expert on normality tests, R.B. D’Agostino, makes a very strong statement: “The Kolmogorov-Smirnov test is only a historical curiosity. It should never be used.” (“Tests for Normal Distribution” in Goodness-of-fit Techniques, Marcel Decker, 1986).

Take that, Andrey!

from bisect import bisect_left as bl
from scipy.stats import norm
import matplotlib.pyplot as plt
import numpy as np
    
# normal cdf
rv = norm()

def calc_stat(xL,yL):
    xL.sort()
    dL = list()
    for x,y0 in zip(xL,yL):
        y1 = rv.cdf(x)
        dL.append(abs(y1-y0))
    return max(dL)

def p_value(d,L):
    #for i,e in enumerate(L):
        #if d < e:
            #break
    i = bl(L,d)
    p = 1 - ((i+1)*1.0/len(L))
    return p
#-----------------------------------
# simulation
N = 10000
n = 20
yL = range(1,n+1)
yL = [1.0*y/n for y in yL]  #replaces "steps"
pL = list()

for i in range(N):
    xL = np.random.normal(0,1,n)
    D = calc_stat(xL,yL)
    pL.append(D)

plt.hist(pL,bins=50)
plt.savefig('example.png')
#-----------------------------------
pL.sort()
for d in [0.199, 0.246, 0.264, 0.294]:
    p = p_value(d,pL)
    print 'd = %3.3f' % d,
    print 'p = %3.2f' % p

print
for p in [0.15, 0.10, 0.05, 0.01]:
    i = int(N*(1-p))
    d = pL[i]
    print 'p = %3.2f' % p,
    print 'd = %3.3f' % d

KS test

I ran into a question about using the Kolmogorov-Smirnov test for goodness-of-fit in R and Python. I'd heard of the KS test but didn't know what it was about so I decided to explore.

Wikipedia's explanation was opaque at first (not surprising), but I found nice ones here and here.

As the references say, the KS test can be used to decide if a sample comes from a population with a specific distribution. We calculate a statistic D, and if the statistic is large enough we reject the null hypothesis that the sample comes from the specified distribution.

What is the D statistic? I think it is just the maximum deviation between the observed cdf for the sample, compared to the distribution. That works for the script below.

The KS test is a non-parametric test, it doesn't depend on assumptions about the distribution (e.g. standard deviation) of the data.

For these tests, I needed to reinstall scipy. The previous difficulties were not observed this time. It was as simple as:

git clone git://github.com/scipy/scipy
cd scipy
python setup.py build
sudo python setup.py install

The script below grabs 20 numbers from the normal distribution (sd = 2.0) and compares them with the standard normal distribution (sd = 1). The script does a plot (screenshot above) and prints the results of the KS test:

> python script.py 
max 0.198
D = 0.20, p-value = 0.36

We are not able to reject the null hypothesis, even though the source of the numbers had sd=2 and the null hypothesis is that sd=1.

I did a quick check by copying the numbers

x=c(-5.05,-2.17,-1.66,-1.35,-0.98,-0.97,-0.84,-0.74,-0.67,-0.00,0.21,0.24,0.33,0.67,0.94,1.63,1.65,2.05,2.22,3.19)

and pasting them into R. Then:

> ks.test(x,pnorm,0,1)

 One-sample Kolmogorov-Smirnov test

data:  x 
D = 0.1986, p-value = 0.3611
alternative hypothesis: two-sided

Which matches. I also wrote a slightly hackish test of the test in R:

f <- function () {
 c = 0 
 for (i in 1:100) {
 y1 = rnorm(n,mean=0,sd=1)
 result = ks.test(y1,"pnorm",0,1)
 if (result$p.value < 0.05) {
  print (result$statistic)
  print (result$p.value)
  }
 }
}

This function spews out the values of the parameters if the p-value is < 0.05. We're testing against the same distribution from which the samples were drawn (i.e. the null hypothesis is correct), so we expect that about 5 of the 100 tries will have a test statistic so extreme that we're fooled into thinking the null hypothesis is wrong. Output:

> n = 10
> f()
       D 
0.454372 
[1] 0.020945
        D 
0.4113619 
[1] 0.04811348
        D 
0.4722679 
[1] 0.01440302
      D 
0.43666 
[1] 0.02984022
        D 
0.5340613 
[1] 0.003428069
        D 
0.4336947 
[1] 0.03161219

Looks good to me.

script.py

import matplotlib.pyplot as plt
from scipy.stats import uniform, norm, kstest
import numpy as np

np.random.seed(1773)

def steps(L):
    delta = 1.0/len(L)
    y = 0
    rL = list()
    for x in L:
        y += delta
        rL.append(y)
    return rL

xL = np.random.normal(0,2.0,20)
xL.sort()
yL = steps(xL)

#----------------------------------

# points
plt.scatter(xL, yL, s=50, zorder=1)

# distribution
rv = norm()
pL = np.linspace(-3, 3)
plt.plot(pL, rv.cdf(pL), '--')

# vertical lines
dL = list()
for x,y0 in zip(xL,yL):
    y1 = rv.cdf(x)
    dL.append(abs(y1-y0))
    plt.plot((x,x),(y0,y1),
        color='r', zorder=0)

ax = plt.axes()
ax.set_xlim(-4, 4)
ax.set_ylim(-0.1, 1.1)
plt.savefig('example.png')

print 'max %3.3f' % max(dL)
t = kstest(xL,'norm')
print 'D = %3.2f, p-value = %3.2f' % t

xL = ['%3.2f' % n for n in xL]
print ','.join(xL)