## Wednesday, December 14, 2011

### Getting a little help from Newton

I've posted here on a number of topics and derivations in fundamental calculus. This is about finding the derivative for powers of x with negative or fractional exponents. It seems a trivial proof, until I got stopped in my tracks by the cofactors of the binomial distribution.

The idea is to find the slope of the curve at x by adding a little bit to x, (x + h), calculating f(x + h), then subtract f(x), and finally divide by h and finding the limit as h gets very small. It works great for y = x2. But consider

 `y = x1/2`

We want to do:

 `1/h [(x + h)1/2 - x1/2]`

So thinking about the binomial expansion of the x + h term, I'm trying to figure what is the expansion for the 1/2 power? I mean, what can "n choose k" mean when n is 1/2?

The other thing that is a good approach here is to bring x out of the sum as follows

 ```(x + h)1/2 = x1/2(1 + h/x)1/2```

To expand this, we just remember Newton's super-duper binomial expansion from here. If the expression is

 `(1 + Q)m/n`

the expansion is

 `1 + (m/n)Q + (m/n)((m-n)/2n)Q2 + (m/n)((m-n)/2n)((m-2n)/3n)Q3 + .. `

We only need the first two terms because Q2 is (h/x)2 which we will ignore. So we have for the expansion

 ```(1 + h/x)1/2 = 1 + 1/2(h/x)```

and the derivative is the limit as h => 0 of

 ```1/h [x1/2(1 + 1/2(h/x)) - x1/2] = 1/h (1/2 h) 1/x1/2 = 1/2 x-1/2```

Similarly for m = -2 and n = 1 the first two terms are 1 and -2 and the expansion we need will be

 ```(1 + (h/x))-2 = = 1 - 2(h/x)```

The derivative is the limit as h => 0 of

 ```1/h [x-2 (1 - 2(h/x)) - x-2] = = 1/h x-2 (-2)(h/x) = -2 x-3```

Easy when you know how.

### Basic Python for Bioinformatics

I've been silent online for the last few months, due to other commitments associated with my day job. But I haven't forgotten all of you in the ether.

Just before I got busy, I wrote a book about basic programming with Python using lots of biological examples. It's been proofed once or twice, but is not quite a beta version, so let's call it the gamma release. I want to push it out now, even it could certainly use some more polish, because it's not clear when I'll get the time to do that.

The target audience is the novice programmer who is interested in biology. However, it differs from most similar material because there are lots of biological and sequence analysis examples, collected from various sources including posts on this site.

Here is a screenshot of the toc for the middle section of the book.

The book is in html format (link to Dropbox zip). It was built using the Sphinx software. One can generate a pdf version of the book as well, but it suffers from the major limitation that cut-and-paste of the code loses the indentation. The version linked to here has code examples that can be pasted directly into a text file and executed, and often will work in the interpreter as well.

I hope you like it. I would be very grateful for comments, criticisms, corrections, and even complaints. Post 'em as comments here, or find my work email through the aboutme. Enjoy!

## Wednesday, August 31, 2011

### Binary multiplication, again

I had a lot of fun this afternoon, but all along I knew I was "reinventing the wheel." It's easy when you know where you're going.

I implemented binary multiplication with a class that keeps its data as Python strings like '00100011'. It's probably because I have been re-reading Charles Petzold's book Code, which I think is a classic. In any event, I wrote the class "word" which allows objects to add, multiply and exponentiate themselves. The second file is test.py which exercises the class.

It's a great "simple Python" coding project. We do bit-shifting by hand. If you want to be slick, you could go further by trying to implement fast exponentiation. Here is another example by a guy who obviously knows what he's doing.

A bit of sample output:

 ```> python test.py x 467 y 327 x + y 00000000 00000000 00000011 00011010 = 794 check 467 + 327 = 794 x * y 00000000 00000010 01010100 10000101 = 152709 check 467 * 327 = 152709 z 3 x**z 00000110 00010010 00010010 00001011 = 101847563 check 467 ** 3 = 101847563 x 995 y 402 x + y 00000000 00000000 00000101 01110101 = 1397 check 995 + 402 = 1397 x * y 00000000 00000110 00011010 01110110 = 399990 check 995 * 402 = 399990 z 3 x**z 00111010 10110111 00001100 10111011 = 985074875 check 995 ** 3 = 985074875 x 897 y 490 x + y 00000000 00000000 00000101 01101011 = 1387 check 897 + 490 = 1387 x * y 00000000 00000110 10110100 11101010 = 439530 check 897 * 490 = 439530 z 4 x**z 10111011 11001001 10001110 00000001 = 3150548481 ran into a slight problem```

`word.py`

 ```class word: SZ = 32 def __init__(self,n): if type(n) == type('a'): b = n.rjust(word.SZ,'0') elif type(n) == type(1): b = bin(n)[2:] else: raise ValueError, "can't do that" self.b = b.rjust(word.SZ,'0') def __repr__(self): b = self.b s = ' '.join([b[:8], b[8:16], b[16:24], b[24:]]) s += ' = ' + str(eval('0b' + self.b)) return s def __add__(self, a): carry = '0' rL = list() for i in range(len(self.b)-1,-1,-1): x = self.b[i] y = a.b[i] t = x+y if (t == '01' or t == '10'): if carry == '0': rL.append('1') if carry == '1': rL.append('0') elif t == '00': rL.append(carry) carry = '0' else: assert t == '11' if carry == '0': rL.append('0') carry = '1' else: rL.append('1') carry = '1' if carry == '1': raise ValueError, 'overflow' rL.reverse() return word(''.join(rL)) def __mul__(self, a): L = list() for i in range(len(self.b)-1,-1,-1): x = self.b[i] if not x == '1': continue n = word.SZ - i - 1 r = word(a.b[n:] + '0'*(n)) L.append(r) res = L.pop(0) #print 'start: ', res while L: next = L.pop(0) #print 'next: ', next res = res + next #if L: #print 'intermediate:', #else: #print 'result: ', #print res return res def __pow__(self, n): res = self for i in range(n-1): res = res * self return res```

`test.py`

 ```import random from word import word R = range(1000) N = 10 for i in range(N): x = random.choice(R) y = random.choice(R) xw = word(x) print 'x ', x yw = word(y) print 'y ', y print 'x + y', r = xw + yw print r S = x+y assert S == int(str(r).split()[-1]) print 'check', x, '+', y, '=', S print 'x * y', r = xw * yw print r P = x*y assert P == int(str(r).split()[-1]) print 'check', x, '*', y, '=', P for i in range(2,N*2): try: r = xw**i except ValueError: i = i-1 break z = i print 'z ', z print 'x**z ', r E = x**z try: assert E == int(str(r).split()[-1]) except AssertionError: print 'ran into a slight problem\n' continue print 'check', x,'**', z, '=', E print```

### Dissecting RSA keys in Python (4)

Still working with RSA keys. I accomplished the most important of the items left undone so far, to use the keys to encrypt a message using my own code.

In the process, I learned a new simple Python fact: the `pow` function (which is now a built-in), can take a modulus as a third argument.

And, it works where this doesn't seem to: (x**y) % z

(I discovered this by digging into the rsa module source).

 ```python -m timeit -s "import rsa; \ f = open('id_rsa'); data = f.read(); f.close(); k = rsa.PrivateKey.load_pkcs1(data)"\ "41330915578951772302369**k.e % k.n" 10000 loops, best of 3: 79.4 usec per loop python -m timeit -s "import rsa; \ f = open('id_rsa'); data = f.read(); f.close(); k = rsa.PrivateKey.load_pkcs1(data)"\ "pow(41330915578951772302369, k.e, k.n)" 10000 loops, best of 3: 70.2 usec per loop```

In the example here, both work and have about the same timing. But in the code below, the first version hangs when doing decryption (with a large base). Here's the output, followed by the script.

 ```> python encode.py m: Hello, secret world! p: .xyz.Hello, secret world!.xyz. a: 413309155789517723023698766343791993289928631329 c: 11434905702482726455415220687715293368262190253795 .. i: 413309155789517723023698766343791993289928631329 r: Hello, secret world!```

`encode.py`

 ```import rsa with open('id_rsa') as f: data = f.read() k = rsa.PrivateKey.load_pkcs1(data) n = k.n e = k.e d = k.d def my_atoi(s): L = [ord(c) for c in s] k = 256 iL = L[:] iL.reverse() x = iL[0] for i in iL[1:]: x += i*k k *= 256 return x def my_itoa(i): rL = list() while i: rL.append(i%256) i = i/256 rL.reverse() return ''.join([chr(n) for n in rL]) def encrypt(m): return m**e % n def decrypt(c): # note: c**d % n fails return pow(c,d,n) if __name__ == '__main__': m = 'Hello, secret world!' pad = '.xyz.' p = pad + m + pad a = my_atoi(m) c = encrypt(a) i = decrypt(c) r = my_itoa(i) r = r.replace(pad,'') L = zip('mpacir',[m,p,a,c,i,r]) N = 50 for varname, var in L: s = str(var) print varname + ': ', s[:N], if len(s) > N: print '..' else: print```

### Dissecting RSA keys in Python (3)

I'm working on understanding the structure of RSA keys as output by the ssh-keygen utility (previous posts here and here). At the risk of becoming redundant, I want to try to simplify what we did previously. Let's begin by rehashing old material on bytes in Python:

It's easy to get confused between a byte and its string representation, or for that matter, between an unsigned int and its string representation. Although hexadecimal is natural too, it's perhaps easiest to think of an individual byte in terms of the decimal equivalent (0..255). That's because a Python int converts easily to chr, bin, and hex.

 ```>>> i = 35 >>> c = chr(i) >>> c '#' >>> i == ord(c) True >>> b = bin(i) >>> b '0b100011' >>> h = hex(i) >>> h '0x23'```

Notice that hex and bin both give unpadded output:

 ```>>> hex(1) '0x1' >>> bin(1) '0b1'```

Don't be fooled. h and b here are 'str' datatypes. But these string representations of bin and hex values go back to int easily (though we must specify the old base explicitly):

 ```>>> int(b,2) 35 >>> int(h,16) 35```

To interconvert hex and bin, it's easiest to go through int:

 ```>>> bin(35) '0b100011' >>> hex(35) '0x23' >>> hex(int(bin(35),2)) '0x23' >>> bin(int(hex(35),16)) '0b100011'```

The struct module

This module performs conversions between Python values and C structs represented as Python strings. This can be used in handling binary data stored in files or from network connections, among other sources. It uses Format Strings as compact descriptions of the layout of the C structs and the intended conversion to/from Python values.

`unpack` works on the particular string representation of a byte in which it is a single character:

 ```>>> i = 35 >>> c = chr(i) >>> c '#' >>> unpack('B',c) (35,) >>> b = bin(i) >>> unpack('B',b) Traceback (most recent call last): File "", line 1, in struct.error: unpack requires a string argument of length 1 >>>```

To construct a multi-byte value like a 4-byte unsigned int we can do this:

 ```>>> h = '\x23' >>> type(h) >>> b_little = '\x23\x00\x00\x00' >>> type(b_little) ```

With multi-byte objects (like ints) we get into endian-ness---'I' is the format for an unsigned int, and '>' means big-endian:

 ```>>> unpack('I', b_little) (35,) >>> b_big = '\x00\x00\x00\x23' >>> unpack('I', b_big) (587202560,) >>> unpack('>I', b_big) (35,)```

One can also use the binascii module, but this is more complicated. Let's write two bytes to a file and then read the binary data:

 ```>>> fn = 'temp' >>> FH = open(fn,'w') >>> FH.write('##') >>> FH.close() >>> FH = open(fn,'rb') >>> data = FH.read() >>> FH.close()```

This "data" is still a string:

 ```>>> type(data) >>> data '##' >>> len(data) 2 >>> b2a_hex(data[0]) '23' >>> b2a_hex(data) '2323' >>> b2a_hex('#Az') '23417a'```

Or you can get fancy, but it seems unnecessary..

 ```>>> import array >>> L = array.array('B','#Az') >>> type(L) >>> L[0] 35 >>> L[1] 65 >>> L array('B', [35, 65, 122]) >>> b2a_hex(L) '23417a'```

With all this in mind, we can redo the script from yesterday to read the base64-encoded key data in a way that is simply understandable. The output first:

 ```> python read.py dlen 7 dlen 1 dlen 129 209,8,39,27 .. 169,97,145,127 0xd1,0x8,0x27,0x1b .. 0xa9,0x61,0x91,0x7f 1467871546 .. 1722964351```

We decode the data and break it into 3 parts exactly as before. Looking at the third part, we unpack it byte by byte. The result of unpacking is a list of ints. We convert that list directly into the large number n (by doing b * 256**i for each byte, moving along the list in reverse order), or we can convert the ints into hex values, assemble that list into one long hex string, and then call `eval` as before. It seems perfectly transparent now.

It even seems clear (in retrospect) why there is an additional null byte after the int that specifies the size of the third segment. It is probably to align the base64 encoding to begin with the first data byte.

The only remaining difficulty is that this approach fails for the private key. So that's still a mystery.

`read.py`
 ```import base64 from struct import unpack FH = open('data.txt','r') data = FH.read() FH.close() b64_data = ''.join(data.strip().split()) data = base64.b64decode(b64_data) L = list() while data: dlen = unpack('>I',data[:4])[0] L.append(data[4:dlen+4]) data = data[dlen+4:] print 'dlen', dlen print # let's look at n bL = L[2] bL = bL[1:] # extra null first, why? iL = [unpack('B',b)[0] for b in bL] hL = [hex(i) for i in iL] # look at the int and hex values pL = [str(n) for n in iL] print ','.join(pL[:4]), '..', ','.join(pL[-4:]) print ','.join(hL[:4]), '..', print ','.join(hL[-4:]), '\n' # first just do the computation ourselves m = 256 iL.reverse() n = iL[0] for i in iL[1:]: n += i*m m *= 256 s = str(n) print s[:10], '..', s[-10:] # hex doesn't pad the output # remove the '0x' and add the extra '0' if needed for i in range(len(hL)): b = hL[i] if len(b) < 4: hL[i] = '0' + b[-1] else: hL[i] = b[-2:] h = '0x' + ''.join(hL) nh = eval(h) assert n == nh```

## Tuesday, August 30, 2011

### Dissecting RSA keys in Python (2)

The public key from last time was given (in `id_rsa.pub`) as:

 ```AAAAB3NzaC1yc2EAAAABIwAAAIEA0QgnG+LBQosx pbkJIU0eWV9Zf7L63fy3BXp6T9RQH84yISs8SMvF V4J1NYJQtNpGopNFlvzsuFnnxaRaM8ureFIQVa8k /dRzv3r7QyIuoGyNQLDk3K8Tse/55fjuIxJFEf5j voJs3Z/jYdPHreg8wzNCB/uMkbCccYsVt6lhkX8=```

(It has been broken into 40 character lines). If we look at the first 32 characters of this base64 encoding:

 ```AAAAB3NzaC1yc2EA AAABIwAAAIEA0Qgn```

the decoded version of this in hex is:

 ```0x0 0x0 0x0 0x7 0x73 0x73 0x68 0x2d 0x72 0x73 0x61 0x0 0x0 0x0 0x1 0x23 0x0 0x0 0x0 0x81 0x0 0xd1 0x8 0x27```

The first four bytes are

`0x0 0x0 0x0 0x7`

interpreted as a 32-bit unsigned int that is equal to 7, and indicates that the next 7 bytes are grouped together to form this element of the data (which is 'rsa-ssh', as I mentioned before).

 ```0x73 0x73 0x68 0x2d 0x72 0x73 0x61 s s h - r s a```

The next 4 bytes after that are:

`0x0 0x0 0x0 0x1`

indicating the following block is a single byte of data: `0x23`. In ASCII-encoding this would be the character '#', but it's actually an unsigned int:

 ```>>> int('0x00000023', 16) 35```

This is the value of the exponent for the key pair. More on this in a moment. Finally we get to

`0x0 0x0 0x0 0x81`

 ```>>> int('0x00000081', 16) 129```

It turns out there are exactly 129 bytes left in the data. In the code from the first post (from here), we used the struct module to unpack these three segments of data (we labeled them "parts").

According to the docs, the struct module can be used to interpret strings as packed binary data.

The first argument to unpack is a format string. The first character of the format string can be used to indicate the byte order, size and alignment of the packed data

The '>' indicates big-endian data, data laid out in registers in the same order as it is in memory. While Intel x86 (and current Macs) are little-endian, network order is big-endian and (I suppose) the SSH standard respects this. The next format character 'I' indicates the data following are unsigned ints.

 ```import struct hL = ['0x0','0x0','0x0','0x81'] iL = [int(h,16) for h in hL] cdata = ''.join([chr(n) for n in iL]) res = struct.unpack('>I', cdata)```

 ```>>> print res (129,)```

It returns a tuple and we want the first value.

In the code from yesterday, we simply copied the correct number of bytes in each segment (as "characters"---not ASCII).

In the last part of this example, we need to transform those bytes into numerical data. We'll use struct.unpack again, except this time specifying binary data. For the exponent we had a single byte '\x23'', corresponding to character '#' which is 35 as an unsigned int:

 ```>>> struct.unpack('B','#') (35,)```

Our function f passed the first element of this tuple back to a big list comprehension.

`e = eval('0x' + ''.join(['%02X' % f(x) for x in parts[1]]))`

In the case of the exponent (e) it operated on only a single value, so I'll rewrite it to clarify what this does. Remembering that f(x) will return 35, we'll get rid of the list stuff and just do:

 ```>>> s = '%02X' % 35 >>> s '23' >>> e = eval('0x' + s) >>> >>> e 35```

(I'm not too swift with format strings, but you can read about them here).

And it doesn't do much in this case! But suppose we had two of these bytes in a row. Then:

 ```>>> s = ('%02X' % 35) * 2 >>> s '2323' >>> eval('0x' + s) 8995 >>> 35*256 + 35 8995```

`eval` will turn our string of bytes into a multi-byte unsigned int. This is news to me. So, if we actually had the string of bytes in hex divided into two sets of bytes, those up to and including '0x81'

 `['0x0', '0x0', '0x0', '0x7', '0x73', '0x73', '0x68', '0x2d', '0x72', '0x73', '0x61', '0x0', '0x0', '0x0', '0x1', '0x23', '0x0', '0x0', '0x0', '0x81']`

and those after:

 ```L = ['0x0', '0xd1', '0x8', '0x27', '0x1b', '0xe2', '0xc1', '0x42', '0x8b', '0x31', '0xa5', '0xb9', '0x9', '0x21', '0x4d', '0x1e', '0x59', '0x5f', '0x59', '0x7f', '0xb2', '0xfa', '0xdd', '0xfc', '0xb7', '0x5', '0x7a', '0x7a', '0x4f', '0xd4', '0x50', '0x1f', '0xce', '0x32', '0x21', '0x2b', '0x3c', '0x48', '0xcb', '0xc5', '0x57', '0x82', '0x75', '0x35', '0x82', '0x50', '0xb4', '0xda', '0x46', '0xa2', '0x93', '0x45', '0x96', '0xfc', '0xec', '0xb8', '0x59', '0xe7', '0xc5', '0xa4', '0x5a', '0x33', '0xcb', '0xab', '0x78', '0x52', '0x10', '0x55', '0xaf', '0x24', '0xfd', '0xd4', '0x73', '0xbf', '0x7a', '0xfb', '0x43', '0x22', '0x2e', '0xa0', '0x6c', '0x8d', '0x40', '0xb0', '0xe4', '0xdc', '0xaf', '0x13', '0xb1', '0xef', '0xf9', '0xe5', '0xf8', '0xee', '0x23', '0x12', '0x45', '0x11', '0xfe', '0x63', '0xbe', '0x82', '0x6c', '0xdd', '0x9f', '0xe3', '0x61', '0xd3', '0xc7', '0xad', '0xe8', '0x3c', '0xc3', '0x33', '0x42', '0x7', '0xfb', '0x8c', '0x91', '0xb0', '0x9c', '0x71', '0x8b', '0x15', '0xb7', '0xa9', '0x61', '0x91', '0x7f', '0x0']```

Note: The extra byte at the end here is caused by our having decoded by hand.

Paste it into the interpreter. Then, they have to be padded out...

 ```for i in range(len(L)): s = L[i] c = s[-1] if len(s) < 4: L[i] = '0' + c else: L[i] = s[-2:]```
Remove the first and last bytes and turn it into one long hex value:
 ```>>> s = '0x' + ''.join(L[1:-1]) >>> eval(s) 146787154640181728129549360865206451974125178122992752499327844555082827713683060312609720092642289502088305950292885968241446393671268243539585900329449529436170858131743078225065287541399805133121852823856839448386268013284889428740476278604926908637093742329548018673369795976229837442944484597101722964351L```

That matches what we got yesterday, and what the rsa module gave us for n from the private key.

This use of `eval` seems to be widely known, but I would never have appreciated it from just reading the docs.

Here is a quick script that shows all the values for each set of four characters in the base64 encoding, first 9 bytes of output first:

 ```> python script2.py ['A', 'A', 'A', 'A'] [0, 0, 0, 0] ['0b0', '0b0', '0b0', '0b0'] ['000000', '000000', '000000', '000000'] ['00000000', '00000000', '00000000'] 0x0 0x0 0x0 0 0 0 ['B', '3', 'N', 'z'] [1, 55, 13, 51] ['0b1', '0b110111', '0b1101', '0b110011'] ['000001', '110111', '001101', '110011'] ['00000111', '01110011', '01110011'] 0x7 0x73 0x73 7 115 115 ['a', 'C', '1', 'y'] [26, 2, 53, 50] ['0b11010', '0b10', '0b110101', '0b110010'] ['011010', '000010', '110101', '110010'] ['01101000', '00101101', '01110010'] 0x68 0x2d 0x72 104 45 114```

 ```from string import * v = True cL = list(uppercase + lowercase + digits + '+/') D = dict(zip(cL,range(len(cL)))) D['='] = 0 key = '''AAAAB3NzaC1yc2EAAAABIwAAAIEA0QgnG+LBQosxpbkJIU0eWV9Zf7L63fy3BXp6T9RQH84yISs8SMvFV4J1NYJQtNpGopNFlvzsuFnnxaRaM8ureFIQVa8k/dRzv3r7QyIuoGyNQLDk3K8Tse/55fjuIxJFEf5jvoJs3Z/jYdPHreg8wzNCB/uMkbCccYsVt6lhkX8=''' L = list(key) while L: s = L[:4] L = L[4:] if v: print s dL = [D[c] for c in s] if v: print dL dL = [bin(c) for c in dL] if v: print dL bL = list() for b in dL: n = 8 - len(b) b = n*'0' + b[2:] bL.append(b) if v: print bL b = ''.join(bL) bL = [b[:8], b[8:16], b[16:]] if v: print bL iL = [int('0b' + b,2) for b in bL] for i in iL: print hex(i), print for i in iL: print i, print '\n' ```

## Monday, August 29, 2011

### Dissecting RSA keys in Python

I want to look at RSA keys a little bit more in this post. Let's generate a public/private pair (1024 bits) using the SSH utility keygen (no passphrase):

 ```> ssh-keygen -b 1024 -t rsa Generating public/private rsa key pair. Enter file in which to save the key (/Users/telliott_admin/.ssh/id_rsa): id_rsa Enter passphrase (empty for no passphrase): Enter same passphrase again: Your identification has been saved in id_rsa. Your public key has been saved in id_rsa.pub. The key fingerprint is: 96:49:5d:40:b1:90:de:99:c5:05:43:f2:92:45:69:67 telliott_admin@c-98-236-78-154.hsd1.wv.comcast.net The key's randomart image is: +--[ RSA 1024]----+ | .o=*B+. | | .o B=.E | | ...==.o | | ..o+. | | S | | . | | | | | | | +-----------------+```

[ To be honest, at this point, I'm not actually sure this is the randomart that goes with the example in this post.. but I think it could be :) ]

Start with the "public" key. The data in the file looks like this:

`id_rsa.pub`
 `ssh-rsa AAAAB3NzaC1yc2EAAAABIwAAAIEA0QgnG+LBQosxpbkJIU0eWV9Zf7L63fy3BXp6T9RQH84yISs8SMvFV4J1NYJQtNpGopNFlvzsuFnnxaRaM8ureFIQVa8k/dRzv3r7QyIuoGyNQLDk3K8Tse/55fjuIxJFEf5jvoJs3Z/jYdPHreg8wzNCB/uMkbCccYsVt6lhkX8= my_identifier`

Where my_identifier is

`telliott_admin@c-myipaddress_with_dashes.hsd1.wv.comcast.net`

The common representation of RSA keys (like we see in the file above) uses base64 format. I was confused about this at first, but found a great explanation here, and of course in wikipedia:

The base64, base32, and base16 encodings convert 8 bit bytes to values with 6, 5, or 4 bits of useful data per byte, allowing non-ASCII bytes to be encoded as ASCII characters for transmission over protocols that require plain ASCII, such as SMTP. The base values correspond to the length of the alphabet used in each encoding. There are also URL-safe variations of the original encodings that use slightly different results.

Consider the byte `1001-1110`

We can encode this byte using hexadecimal representation as `'9e'`

 ```>>> b = 0b10011110 >>> b 158 >>> hex(b) '0x9e'```

I didn't actually know you could do this, i.e. using 0b with no quotes (see SO).

The encoding of those 8 bits (one byte) as ASCII characters takes 2 bytes. The '9' and the 'e' take up that much space. An advantage of this representation of the byte is that it's easy to read, also the information can be transmitted to devices that are expecting an ASCII string, where the highest-value bit should be unset.

A disadvantage is that it doubles the size of the string that we're transmitting. As the quote indicates, the other encodings cram more bits of information into the byte to be transmitted. In base32 we could use (RFC4648) the characters 'A'..'Z' plus '2'..'7'.

In base64 we use uppercase, lowercase, digits plus '+' and '/', except that:

Using standard Base64 in URL requires encoding of '+', '/' and '=' characters into special percent-encoded hexadecimal sequences ('+' = '%2B', '/' = '%2F' and '=' = '%3D'), which makes the string unnecessarily longer.
For this reason, a modified Base64 for URL variant exists, where no padding '=' will be used, and the '+' and '/' characters of standard Base64 are respectively replaced by '-' and '_',

The wikipedia article has a great graphic showing an encoding in detail. We do conversions easily by using the Python base64 module.

 ```>>> import base64 >>> e = base64.b64encode >>> d = base64.b64decode >>> e('Man') 'TWFu' >>> d('TWFu') 'Man' >>> s = 'Fourscore, and seven years ago' >>> len(s) 30 >>> c = e(s) >>> c 'Rm91cnNjb3JlLCBhbmQgc2V2ZW4geWVhcnMgYWdv' >>> d(c) 'Fourscore, and seven years ago'```

The encoding process looks at each sequence of 24 bits in the input (3 bytes) and encodes those same 24 bits spread over 4 bytes in the output. The last two characters [in the next example], the ==, are padding because the number of bits in the original string was not evenly divisible by 24 in this example.

 ```>>> e('Ma') 'TWE='```

We break this two byte (16 bit) value up into 6 + 6 + 6 = 18 bits---that makes three base64 values, and the third one is now E rather than F because the bit pattern is 000100, where the last two 00 are padding. Additional adding is added to bring us back to the byte boundary.

In Lincoln's famous phrase, there are 30 bytes, which is evenly divisible by 6, hence no padding is needed.

The second part of the problem was solved by a question I found on SO

Here is my version of the code:

 ```import base64 import struct data = open('id_rsa.pub').read().split(None)[1] print data[:24] keydata = base64.b64decode(data) print keydata[:18] parts = [] while keydata: dlen = struct.unpack('>I',keydata[:4])[0] data, keydata = keydata[4:dlen+4], keydata[4+dlen:] parts.append(data) def f(x): return struct.unpack('B', x)[0] e = eval('0x' + ''.join(['%02X' % f(x) for x in parts[1]])) print 'e', e n = eval('0x' + ''.join(['%02X' % f(x) for x in parts[2]])) print 'n', n```

I have to confess I haven't sorted this out completely yet. Nevertheless, it allows us to convert the RSA public key from ssh-keygen to an integer:

Output:

 ```> python script.py AAAAB3NzaC1yc2EAAAABIwAA ssh-rsa# e 35 n 146787154640181728129549360865206451974125178122992752499327844555082827713683060312609720092642289502088305950292885968241446393671268243539585900329449529436170858131743078225065287541399805133121852823856839448386268013284889428740476278604926908637093742329548018673369795976229837442944484597101722964351```

I want to finish by showing that these values are actually correct, and tie the whole thing back to an earlier discussion of the mathematics of public key cryptography.

I found a module for Python that handles RSA cryptography.

One very nice thing about it is that it's pure Python. It might not be fast, but there's nothing to do to build it. (I used easy_install). It turns out that the rsa module can read the private RSA key that we got earlier (top of the post), but not the public key (which is why I learned everything that's come before this point):

 ```> python Python 2.7.1 (r271:86832, Jun 16 2011, 16:59:05) [GCC 4.2.1 (Based on Apple Inc. build 5658) (LLVM build 2335.15.00)] on darwin Type "help", "copyright", "credits" or "license" for more information. >>> import rsa >>> with open('id_rsa') as f: data = f.read() ... >>> data.split()[4][:50] 'MIICWgIBAAKBgQDRCCcb4sFCizGluQkhTR5ZX1l/svrd/LcFen' >>> >>> k = rsa.PrivateKey.load_pkcs1(data) >>> k.n 146787154640181728129549360865206451974125178122992752499327844555082827713683060312609720092642289502088305950292885968241446393671268243539585900329449529436170858131743078225065287541399805133121852823856839448386268013284889428740476278604926908637093742329548018673369795976229837442944484597101722964351L >>> k.e 35 >>> k.d 71296617968088267948638260991671705244575086516882194071102095926754516318074629294696149759283397758157177175856544613145845391211758861147798865874304045384903321200809536161346504709634159813307741235395231601606901597492591281205810328850837168587984349087962868939575631849929827946396907884195208774355L >>> k.p 12214944519618901215070692893221534580187337690484280451921546594832793966376942655508370083135703324858458340368238021755353370324360760471666368429071147L >>> k.q 12017013618393528112233953119917677248859562342149056793052943111068006176296374093622600767249370271694572460567324457033873003984957249202580919628769533L >>> k.n == k.p * k.q True >>>```

You can see that n equals p * q and that it matches what we eventually obtained from the public key file (`id_rsa.pub`) above. That's pretty cool.

Still to do: use the keys to do encryption and decryption and check that the math is right. Figure out how the hash/digest is done. And figure out how the passphrase is used to modify the private key file. We can even add a passphrase at this point to the old version:

`ssh-keygen -f id_rsa -p`

I used 'abcde' as a passphrase (weak!):

 ```> cat id_rsa -----BEGIN RSA PRIVATE KEY----- Proc-Type: 4,ENCRYPTED DEK-Info: AES-128-CBC,FB4D3BD371510C4B37552EF6E949CC48 E+KtbKVAwE4Dxj3vKYgfHpbgqF9Nk6O4vJ+C6wogXEHXr8vkM9SF5pz6lKNfGRbC S93/lurvq+1+rCuldDNmEjdy36cbgrWJ1CIwB6fq5VyT3+v+AxDMqV9nQkvHjnxR /ON2n+CjOSklnvmaBk3l8+IvCIKar2hGS/1qHqJRKJ3B1fgd5I8i9yH6hx8yl8Qm AfVPgBJ4+91dCW2Y8odtXENgjstGtWZkYd0O+EqPuSuPHLu+J7M2KSQNe8aKOhGf DXCEwR4aGW8EXfRNQpULbi5JruYebYlg3JILm2MaFwy+OW39q1VElvBc5wD01awW BF7Y9x++bi9krX/G49NzkLdSVshkDMSP+eRZ4sncWvDCrkN2xlWqnNfQ06I+XCVW FMUmyMFI87eNmjcZOgIi1S+fQtWcbl4o9Hx1z6xcMd7WxOnfIxe5zHIu8C24WONW spcVxSsHAA57jYa8j292prtK6HDDa6UOAa32hBz1lvC8fMWPxwluTJ8eyeeK/7O6 hNI5G9gM4MsMEMau8O49fEm0RuiT/aB8IXZeEVQNbV/FdixyBBhY53DMnFQ5wau8 OJCpL51S5lj2o8eAJc61SHwX/Fc9uF+rmAFy08tthLuCF2wGdZWsE7IJ8NQzFJA+ Us2dTCjY57V/QoCC41rWh+YP+srcEK1Wp7DfvXcq98N1/+tECyCe/aDdPCIM/8QD glKrx8Ah7Tsx6of3vCHiYG0q8Yu68NNJOiWrnGMEmP0Gq9yhLYqKrc8a3oIP0Lst Ss+0p989df73GdM3nyqtLUwAzaCJQnDivLFT6gltEnM= -----END RSA PRIVATE KEY-----```

## Sunday, August 28, 2011

### Duly quoted

Recycled from an earlier post:

Programs must be written for people to read, and only incidentally for machines to execute.

- Abelson & Sussman, SICP

How do we convince people that in programming simplicity and clarity—in short: what mathematicians call "elegance"— are not a dispensable luxury, but a crucial matter that decides between success and failure?

- E. W. Dijkstra

### Linux Server SSH

To continue with my Ubuntu Linux server odyssey, I got SSH working this afternoon for access from my LAN. As always, heed the warnings, these posts are without a warranty of any kind, not least because I am a total amateur. They exist because it's easy to have Google to "index my brain" or rather, what was in my brain at a certain time.

The objective is to enable remote login for my Ubuntu machine running with a local ip address of 10.0.1.2. We begin by setting a stronger password for my account on Ubuntu under: System > Preferences > About Me (weirdly, it's not under Passwords and Encryption Keys ..).

On the client (OS X Snow Leopard Server), I already have some key pairs from a previous test

 ```> ls ~/.ssh authorized_keys id_dsa id_dsa.pub known_hosts```

but they are DSA keys, which the notes I'm going to follow deprecate. We'll make a short 1024 byte bit RSA pair for now:

 ```ssh-keygen -b 1024 -t rsa > cat id_rsa.pub ssh-rsa AAAAB..5Eoec= telliott_@c-98-___-__-154.hsd1.wv.comcast.net```

Not sure what the pseudo ip address or "comcast" thing is about.

I also take this opportunity to set a passphrase for the public key (and of course I make a note of it). This will be useful (it says) because we can set up the server to require both a password to gain access to the public key, as well as the corresponding private key before login is allowed.

[UPDATE: I'm still working my way through this, but I suspect that the previous statement isn't correct. The passphrase is used to protect the value of the private key. So, in this setup OS X will decrypt the private key using the passphrase, and the stored value of the private key in `id_rsa` is encrypted. I'll try to figure all this out soon. ]

The easiest way to get my new public key onto the server is to use SSH (with password, before we disable it). The next question is, do we need to install an SSH server? It appears yes.

 `sudo apt-get install openssh-server`

There is a pre-existing config file `/etc/ssh/ssh_config` but after the previous command there are more including `sshd_config`. I make sure to save a copy of this file before I modify it. Now try:

 ```ssh te@10.0.1.2 ssh: connect to host 10.0.1.2 port 22: Connection refused```

OK, so we need to modify `/etc/ssh/sshd_config`. `Port22` is already uncommented. Now uncomment:

 ```PermitRootLogin no ChallengeResponseAuthentication yes PasswordAuthentication yes # we'll set it to no eventually```

On the server:

 `sudo /etc/init.d/ssh restart`

On the client:

 ```> ssh te@10.0.1.2 The authenticity of host '10.0.1.2 (10.0.1.2)' can't be established. RSA key fingerprint is d1: .. :83. Are you sure you want to continue connecting (yes/no)? ```

Check the fingerprint on the the server:

 ```ssh-keygen -l -f /etc/ssh/ssh_host_rsa_key.pub 2048 d1: .. :83 /etc/ssh/ssh_host_rsa_key.pub (RSA) Are you sure you want to continue connecting (yes/no)? yes Warning: Permanently added '10.0.1.2' (RSA) to the list of known hosts. Connection closed by 10.0.1.2 > ssh te@10.0.1.2 te@10.0.1.2's password: Welcome to Ubuntu 11.04 (GNU/Linux 2.6.38-11-generic x86_64)```

Now to change to not using our password. We need to copy our RSA public key to the server in some secure way. According to the sshd_config file, the "authorized keys file" is
%h/.ssh/authorized_keys. That's in my home directory. I do:

 ```> scp ~/.ssh/id_rsa.pub te@10.0.1.2:~/.ssh/authorized_keys te@10.0.1.2's password: id_rsa.pub 100% 260 0.3KB/s 00:00 te@VB:~\$ logout Connection to 10.0.1.2 closed. > ssh te@10.0.1.2 Identity added: /Users/telliott_admin/.ssh/id_rsa (/Users/telliott_admin/.ssh/id_rsa) Welcome to Ubuntu 11.04 (GNU/Linux 2.6.38-11-generic x86_64) * Documentation: https://help.ubuntu.com/ Last login: Sun Aug 28 14:35:49 2011 from osxserver.local te@VB:~\$ ```

Now, finally, be sure to turn off password authentication for the ssh server: `PasswordAuthentication no`. At first, I'm not being prompted for a password to retrieve my public key on the server.. just getting automatically logged in by:

 `ssh te@10.0.1.2`

I forgot to do:

 ```ChallengeResponsePasswords yes sudo /etc/init.d/ssh restart```

Restart and I get the challenge.. Although it is apparently cached sometimes. Looks like it works.

### Linux Server

A couple of weeks ago I described installation of Ubuntu Linux (11.04), running under Virtual Box on Mac OS X Lion. I got matplotlib (and scipy) installed and running without any difficulty using `apt-get` alone.

Next, I installed Apache Server 2.2 and configured everything to run Python scripts by using Firefox from Ubuntu on the VM like this:

`localhost/cgi-bin/test.py`

After struggling a bit, I figured out how to use Virtual Box in NAT mode to do "port forwarding"---by modifying

`/etc/apache2/sites-available/default`

so then I could launch the script from the host OS by doing

`localhost:8080/cgi-bin/test.py`

from Safari or by using curl in Terminal.

The next step was to try to access the server from a second machine on my home network. In order to do that, (as I understand it) I can either remain in NAT mode and try to connect via the host machine and port-forwarding, or use "bridging" mode to effectively put the Ubuntu server on the network.

In order to accomplish this, I need to know the ip address of the server as assigned by my Apple Time Capsule router. Now the problem is that this ip address may change with time (since we're getting it from the Time Capsule as a DHCP server). According to the folks I asked on serverfault, I need to switch the Ubuntu network IPv4 settings to manual and ask for a specific ip address and this is correct so far as it goes.

However, the address I ask for in Ubuntu

has to match the range of addresses that the router is set up to hand out. Furthermore, we probably want to have most ip addresses handled dynamically, and only use a static ip for the one machine. If we wanted them all static, we'd go here:

I found the answer here but the setting is actually visible in this screenshot:

Under Internet > DHCP there is a TableView called DHCP reservations. I just reserved the first available ip address. After re-booting Ubuntu and restarting the server, I can run the test script or do:

Now I have to figure out how to access the server from the outside world. I should probably remove the test script from `cgi-bin` (since it reveals information about paths on the server) and read up on firewall configuration, etc. before I do that. The first problem for this part (not yet solved) is that my ip address from my isp, obtained with

`checkip.dyndns.org/`

does not seem to be accessible. I get a "Safari can't connect to the server .." message.

## Saturday, August 27, 2011

### Misunderstanding Euclid

Category: things you thought you knew that are wrong
Today's entry is from mathoverflow:

Many students believe that 1 plus the product of the first n primes is always a prime number. They have misunderstood the contradiction in Euclid's proof that there are infinitely many primes. (By the way, 2 * 3 * 5 * 7 * 11 * 13 + 1 is not prime.)

``` from math import sqrt def f(N): N += 1 print 'N', N n = int(sqrt(N)) for i in range(2, n+1): if not N % i: print i, N/i >>> N = 2*3*5*7*11*13 >>> f(N) N 30031 59 509 >>> N *= 17 >>> f(N) N 510511 19 26869 97 5263 277 1843 >>> N *= 19 >>> f(N) N 9699691 347 27953 ```
What the theorem actually says (wikipedia):

Take any finite list of prime numbers p1, p2, ..., pn. It will be shown that at least one additional prime number not in this list exists. Let P be the product of all the prime numbers in the list: P = p1p2...pn. Let q = P + 1.

Then, q is either prime or not:

If q is prime then there is at least one more prime than is listed.

If q is not prime then some prime factor p divides q. This factor p is not on our list: if it were, then it would divide P (since P is the product of every number on the list); but as we know, p divides P + 1 = q. Then p would have to divide the difference of the two numbers, which is (P + 1) − P or just 1. But no prime number divides 1 so there would be a contradiction, and therefore p cannot be on the list. This means at least one more prime number exists beyond those in the list.

### Late nights on the web

This is just a short post to document some interesting links. I recently posted about getting matplotlib installed with minimum fuss on OS X Lion, and this seems to have helped quite a few folks.

The person who showed me how to do this is named Rui Carmo, and I finally found the time to begin browsing through his extensive site; he has lots of good stuff about Macs and other technology (like here).

Perhaps the most interesting thing I found is what appears to be a very high quality company called Linode, offering Linux Virtual Servers in the cloud, their docs include an extensive library related to the topic. I'm thinking seriously about trying it (it only costs just a bit more than the NY Times).

And the final result (of browsing the web on a sleepless night) is to note the profusion of different sites connected to Stack Overflow, including serverfault and superuser. Very rewarding to browse through Q & A, and of course, when I have a question of my own.

One more thing: a pdf form-filler link that is not Adobe. One of the sites I use for work has Adobe forms on it that I can only read through Safari if I have Reader installed as the default pdf viewer. Maybe it will help with this, since I loathe Adobe.

## Wednesday, August 24, 2011

### Pretty code (6)

This is absolutely the last post about html and code stuff. The whole adventure started here. It's a bit amusing to realize that if I had just understood what <br> tags were about (and <span> and <div>), this would never have happened.

I decided to write my own parser for Python code, so that I could post the results here. On the way, I learned something about "styles" and stylesheets. To organize things, I put this code for a "style" in the template for the blog at the end of <head>:

```<style type='text/css'>
div.content
{
font-size:12pt;
font-family:arial;
color: #404040;
}

div.code
{
font-size:12pt;
font-family:courier;
margin: 20px;
border-style: solid;
border-width: 1px;
border-color:#ff00ff;
}

span.kw   { color:blue; }
span.str  { color:red; }
span.str3 { color:purple; }
span.py   { color:#00b0b0; }
span.cm   { color:green; }
</style>```

The parser output looks great when run on itself. The first little bit is here:

But, alas, it was not to be. The Blogger editor is now (still) messing with my code, removing all the <br /> that I carefully put in, so that the informative whitespace is gone in the actual post (though not in the intial preview before posting). It looks hopeless at this point.

Files for the parser on Dropbox (here). The only known bugs are those visible in the comment lines above, where the strings should have been masked by being inside a comment.

Unfortunately, I do not know how to make this all work for Blogger.

What a waste! I am not happy with these guys. The only way I can see is to have my own server.

## Monday, August 22, 2011

### Pretty code (5)

This is the first section of the same example from the last post (here, more explanation here), run through Pygments. It is self-contained and doesn't require loading any other resources.

The code to generate it was:
```> pygmentize -f html -O full -o test.html pyparser.py
```

This is the same parser that is used for the official Python docs, which are generated in combination with Sphinx. I have only a hazy idea of how that's done.

It's easily customizable if you want.

```import sys
from keyword import kwlist
from string import punctuation as punct
from utils import load_data
import html_tags as H
pywords = ['True','False','list','dict',
'int','float','append','extend',
'sys','argv','pop','open','write',
'close','string','time']

try:
fn = sys.argv[1]
except IndexError:
fn = 'example.py'
```

### Pretty code (4)

Explanation of what this is about from last time here. I got triple-quoted strings working, and I'm really pleased with the results. More another time on other options for code-highlighting.
```import sysfrom keyword import kwlistfrom string import punctuation as punctfrom utils import load_dataimport html_tags as Hpywords = ['True','False','list','dict',           'int','float','append','extend',           'sys','argv','pop','open','write',           'close','string','time']try:      fn = sys.argv[1]except IndexError:      fn = 'example.py'data = list(load_data(fn))D = {'is_cm':False,     'in_str_1':False,     'in_str_2':False,     'in_str_3':False }L = list()while data:    c = data.pop(0)    # comments first    if c == '#':        if not (D['in_str_1'] or D['in_str_2']):            L.extend(list(H.cm_start))            D['is_cm'] = True    if c == "\n" and D['is_cm']:        L.extend(list(H.cm_stop))        D['is_cm'] = False    L.append(c)        a = '''triple-quoted-string    with a continuation and two keywords'''        # triple-quoted strings    if c == "'" and len(data) > 1:        if data[0] == "'" and data[1] == "'":            if (not D['in_str_1'] and not D['in_str_2']):                if not D['in_str_3']:                    L.pop()                    L.extend(list(H.str3_start))                    L.extend(["'"] * 3)                    data.pop(0)                    data.pop(0)                    D['in_str_3'] = True                else:                    data.pop(0)                    data.pop(0)                    L.extend(["'"] * 2)                    L.extend(list(H.str3_stop))                    D['in_str_3'] = False                    continue        # single-quoted strings    if c == "'":        if D['in_str_3']:            continue                if (not D['in_str_1']):            if not D['in_str_2']:                # start a str_1                L.pop()                L.extend(list(H.str_start))                L.append(c)                D['in_str_1'] = True            else:                # already in str_2 or str_3                pass        else:            if D['in_str_1']:                # terminate str_1                    L.extend(list(H.str_stop))                D['in_str_1'] = False                # double-quoted strings    if c == '"':        if not D['in_str_2'] and not D['in_str_3']:            if not D['in_str_1']:                # start a str_2                L.pop()                L.extend(list(H.str_start))                L.append(c)                D['in_str_2'] = True            else:                # already in str_1 or str_3                pass        else:            if D['in_str_2']:                # terminate str_2                    L.extend(list(H.str_stop))                D['in_str_2'] = Falses = ''.join(L)# keywords lastpL = list()D['is_str_3'] = Falsefor line in s.split('\n'):    D['is_cm'] = False    words = line.split()    for w in words:        # no kw highlighting in comments        if w.startswith(H.cm_start):            D['is_cm'] = True        if w.startswith(H.str3_start):            D['is_str_3'] = True        if H.str3_stop in w:            D['is_str_3'] = False        if not D['is_cm'] and not D['is_str_3']:            if w in kwlist:                r = H.kw_start + w + H.kw_stop                line = line.replace(w, r)            for p in pywords:                if p in w:                    L = w.split(p)                    if L[0] and not L[0][-1] in punct:                        continue                    if L[1] and not L[1][0] in punct:                        continue                    r = H.py_start + p + H.py_stop                    line = line.replace(p, r)    pL.append(line)    s = H.br.join(pL)pL = [H.head, H.hr, s, H.hr, H.tail]s = '\n'.join(pL)fn = fn.split('.')[0] + '.html'FH = open(fn,'w')FH.write(s + '\n')FH.close()

```

Here's a screenshot of `html_tags.py`:

### Pretty code (3)

I worked a bit more on the Python parser that I described last time (here). Actually, I started over from scratch. The first approach was based on breaking the code into words for processing, but I realized that a "stream" approach of one character at a time is better. The only limitation I'm aware of is that I still don't handle triple-quoted strings, but I think it's doable in the future.

As I mentioned before, I'm using <br /> tags in the code now rather than newlines.

The instructions for doing this said to modify the blog template, but I'm afraid of breaking the formatting for old posts. I put the following just below the <body> tag instead, as shown. It seems to work.
```<body>

<style>
cd { font-size:120%; }
cm { color: green }
kw { color: blue; }
str { color: red; }
</style>

```

Then, to be totally meta about it, I ran the parser on itself. It is set up to grab HTML tags from a separate file (otherwise the meta application chokes on the tags). They are the ones shown above. Plus, there is a head and a tail to make an independent .html document. But for pasting into the blog, you don't need those guys. I invoked it like this:

```python simple_parser2.py simple_parser2.py > example.html
```

And here is the code:

```
import sysfrom keyword import iskeywordfrom utils import load_dataimport html_tags as Htry:      fn = sys.argv[1]except IndexError:      fn = 'example.py'data = list(load_data(fn))D = {'is_cm':False,     'is_str_1':False,'is_str_2':False }L = list()for c in data:    # comments first    if c == '#':        if not (D['is_str_1'] or D['is_str_2']):            L.extend(list(H.cm_start))            D['is_cm'] = True    if c == "\n" and D['is_cm']:        L.extend(list(H.cm_stop))        D['is_cm'] = False    L.append(c)        # single-quoted strings    if c == "'":        if not D['is_str_1']:            if not D['is_str_2']:                # start a str_1                L.pop()                L.extend(list(H.str_start))                L.append(c)                D['is_str_1'] = True            else:                # already in str_2                pass        else:            # terminate str_1                L.extend(list(H.str_stop))            D['is_str_1'] = False    # double-quoted strings    if c == '"':        if not D['is_str_2']:            if not D['is_str_1']:                # start a str_2                L.pop()                L.extend(list(H.str_start))                L.append(c)                D['is_str_2'] = True            else:                # already in str_1                pass        else:            # terminate str_2                L.extend(list(H.str_stop))            D['is_str_2'] = Falses = ''.join(L)# keywords lastpL = list()for line in s.split('\n'):    D['is_cm'] = False    words = line.split()    for w in words:        # no kw highlighting in comments        if w.startswith(H.cm_start):            D['is_cm'] = True        if not D['is_cm'] and iskeyword(w):            r = H.kw_start + w + H.kw_stop            line = line.replace(w, r)    pL.append(line)s = H.br.join(pL)pL = [H.head, H.hr, s, H.hr, H.tail]s = '\n'.join(pL)try:      fn = sys.argv[2]except IndexError:      fn = 'example.html'FH = open(fn,'w')FH.write(s + '\n')FH.close()

```

## Sunday, August 21, 2011

### Pretty code (2)

So the question is, what are the good ways to format code for blog posts, when you don't control the machine that serves the pages (like on Blogger). The first---and motivating---post in this series is here.

I always prefer simple to complex. I even prefer simple html pages, although I'll ignore the obvious problem that pages are not simple any more when Blogger gets through with them.

So I liked the idea of controlling the color of text in a page with self-defined tags. For example, we can put this in the head of our html document:

It's a seductively simple idea. But.. it requires a way of parsing Python code to generate the appropriate tags. I thought if I enforced some rules that could be done relatively easily.

After a few hours, I have something that sort of works, but it has too many bugs to be useful yet.

The main simplifying rule is that multi-line strings are not allowed. Also, at this point I assume that indentation is a four-space tab.

You can see the results in the graphic (above). I put the script on Dropbox (simple_parser.py). Without a lot more debugging, this seems like it's probably not the best option, but it was an entertaining exercise for the afternoon.

### Formatting problem solved

[ UPDATE: So I discovered that the line breaks setting below seems to be retroactive, and changing it messes up the old posts. Another solution is to substitute newlines in the code with <br /> tags. That's what I've done here, and seems OK. This makes it difficult to actually edit code samples in the editor, but them's the breaks. :) ]

On the topic of how to format code samples for the blog, I've been doing some testing and found that the combination of
``` Settings > Formatting section: Convert line breaks No ```
and
``` Settings > Basic section: Global > Select post editor Updated editor ```
appears to have solved the problems that cropped up (my post here) with my standard approach using <table>. A sample:

 ```def f(n):print "Hello world!" ```

And I like it that the new editor's preview mode gives an accurate view of the product.

However, there are other solutions to consider, including how to get syntax highlighting without needing to load resources that are out of my control (my SO question is here), or whether to use a code hosting service and "embed" the code here. I'll have to get back to you on that.

## Saturday, August 20, 2011

### Bioonductor: summary

[ Update: Sorry for the dup. I tried using the new formatting on this one, but found a problem, which I hope is fixed now.]

I promise this is the last one. Here we load the sample data from ExpressionSet. I show (once again) how to obtain all the relevant stuff: expression data, sample names, feature names, and gene names.

One of the nice things about ExpressionSet objects is the very fancy indexing one can do with them. A great example is given on p. 166 of the Bioconductor book.

I made an stab at implementing range selectors here but it's almost trivial. Also, in this one, the Python code does not contain any informative whitespace, so I've used the <code> tag.
The script is first and then the output. If you can't figure something out, just ask.

`script.py`
 ```import rpy2.robjects as robjects from rpy2.robjects.packages import importr import rpy2.rinterface as ri import bioc.biobase as biobase r = robjects.r g = robjects.globalenv #gd = importr('grDevices') dim = r['dim'] r('data("sample.ExpressionSet")') sample_ExpressionSet = r['sample.ExpressionSet'] eset = sample_ExpressionSet #print tuple(ri.globalenv) print eset print dim(eset) ed = eset.exprs print ed.rx(1,1) # no way to do es[x,y] from Python? # just wrap it r(''' do_sel <-function(obj,sel,bycol=FALSE) { if (bycol) { return(obj[,sel]) } else { return(obj[sel,]) } } ''') do_sel = r['do_sel'] # both work: v = robjects.IntVector([2,3,6]) v = robjects.BoolVector([True,False]*13) eset_sub = do_sel(eset,v,bycol=True) eset_sub = do_sel(eset_sub,v) print dim(eset_sub) # no explicit featureData, just from data rows rownames = r['rownames'] fd = rownames(ed) print fd[:2] # get the gene names importr('annaffy') r(''' get.genenames <- function(probeids) { symbols <- aafSymbol(probeids, 'hgu95av2') getText(symbols) } ''') get_genenames = r['get.genenames'] gn = get_genenames(fd) # slightly fancy, no dups, original order from collections import OrderedDict z = zip(gn,[None]*len(gn)) D = OrderedDict(z) L = D.keys() for i in range(4): print L[i*5:i*5+5] print pd = eset.do_slot('phenoData') data = pd.do_slot('data') v = robjects.IntVector(range(1,4)) print do_sel(data,v) ```
```> python script.py
ExpressionSet (storageMode: lockedEnvironment)
assayData: 500 features, 26 samples
element names: exprs, se.exprs
protocolData: none
phenoData
sampleNames: A B ... Z (26 total)
varLabels: sex type score
featureData: none
experimentData: use 'experimentData(object)'
Annotation: hgu95av2

Features  Samples
500       26

[1] 192.742

Features  Samples
3        3

[1] "AFFX-MurIL2_at"  "AFFX-MurIL10_at"

******* Deprecation warning *******:

The package 'hgu95av2' is deprecated and will not be supported in the future.

Instead we strongly reccomend that you should start using the 'hgu95av2.db' package.

We hope you will enjoy these new packages.  If you still have questions, you can search
(http://dir.gmane.org/gmane.science.biology.informatics.conductor) or ask the mailing list
(http://bioconductor.org/docs/mailList.html).

['', 'STAT1', 'GAPDH', 'ACTB', 'TFRC']
['C15orf31', 'ACTL6B', 'GLRA1', 'KCNB2', 'MGAT5']
['BMP3', 'RPL14', 'ZCWPW2', 'KITLG', 'GPR12']
['TRA@', 'ATP8A2', 'C1orf68', 'FAM20B', 'SLC34A1']

sex    type score
A Female Control  0.75
B   Male    Case  0.40
C   Male Control  0.73

>
```

## Friday, August 19, 2011

### Bioconductor: multiple plots

Actually, I have a bit more to do with Bioconductor ExpressionSets, as long as I'm thinking about it. When I first played with the ALL example, I tried other pairs besides ('ALL1/AF4','BCR/ABL')---I'd have to look now, but they are probably in the original paper too. Anyway, in the first talk I gave about this stuff, I made plots of all the pairs. Some are really nice. So that's what I'd like to show now.

The difference is that then, I wrote the code in R, and I did not like the experience.

In Python, we do the standard imports. We factor out the R code into several functions, then load it from a text file in Python, and execute the code in R-space. In Python, we pick pairs to compare and call a plotting function in R.

The R code is a little hackish (I'm thinking of color.map), but this is a good start.

R code:
 ```color.map <- function(g) { colors = c('red','green','blue', 'cyan','magenta','maroon') if (g == 'ALL1/AF4') i=1 if (g == 'BCR/ABL') i=2 if (g == 'E2A/PBX1') i=3 if (g == 'NEG') i=4 if (g == 'NUP-98') i=5 if (g == 'p15/p16') i=6 colors[i] } library('limma') library('annaffy') get.genenames <- function(eset) { probeids <- featureNames(eset) symbols <- aafSymbol(probeids, 'hgu95av2') getText(symbols) } compare.factors <- function(eset,L) { eset.sub = eset[,eset\$mol.biol %in% L] f <- factor(as.character(eset.sub\$mol.biol)) design <- model.matrix(~f) fit <- eBayes(lmFit(eset.sub,design)) pv = fit\$p.value[,2] sel <- p.adjust(pv) < 0.001 eset.sel <- eset.sub[sel,] } plot.eset <-function(eset,...) { patient.colors <- unlist(lapply( eset\$mol.biol,color.map)) genenames = get.genenames(eset) heatmap(exprs(eset), col=topo.colors(100), ColSideColors=patient.colors, labRow=genenames) } ```
The R code is in a file `Rcode.txt`. Python code:
 ```import rpy2.robjects as robjects from rpy2.robjects.packages import importr import bioc.biobase as biobase r = robjects.r g = robjects.globalenv gd = importr('grDevices') importr('ALL') r('data("ALL")') ALL = g['ALL'] eset = biobase.ExpressionSet(ALL) #-------------------------------------------- FH = open('Rcode.txt','r') s = FH.read() FH.close() r(s) compare_factors = r['compare.factors'] plot_eset = r['plot.eset'] #-------------------------------------------- targets = [ ['ALL1/AF4','NEG'], ['ALL1/AF4','BCR/ABL'], ['E2A/PBX1','BCR/ABL'] ] def get_ofn(t): #print t, len(t) u,v = t s = u + '_' + v ofn = s.replace('/','-') + '.pdf' return ofn def plot(eset_sub,t): ofn = get_ofn(t) gd.pdf(ofn) plot_eset(eset_sub,ofn) gd.dev_off() for t in targets: eset_sub = compare_factors(eset,t) plot(eset_sub,t) ```

The first plot is at the top, and the second and third are below. It's pretty clear that there are significant commonalities according to mol.biol. BTW, I adjusted the target p-value to keep the number of hits down.

I also added code to substitute the gene names in the plots. That's pretty cool.

There are problems with the alternatives. That's for another time.