Python for Bioinformatics: 2011

Wednesday, December 14, 2011

Getting a little help from Newton

I've posted here on a number of topics and derivations in fundamental calculus. This is about finding the derivative for powers of x with negative or fractional exponents. It seems a trivial proof, until I got stopped in my tracks by the cofactors of the binomial distribution.

The idea is to find the slope of the curve at x by adding a little bit to x, (x + h), calculating f(x + h), then subtract f(x), and finally divide by h and finding the limit as h gets very small. It works great for y = x². But consider

y = x^1/2

We want to do:

1/h [(x + h)^1/2 - x^1/2]

So thinking about the binomial expansion of the x + h term, I'm trying to figure what is the expansion for the 1/2 power? I mean, what can "n choose k" mean when n is 1/2?

The other thing that is a good approach here is to bring x out of the sum as follows

(x + h)^1/2
= x^1/2(1 + h/x)^1/2

To expand this, we just remember Newton's super-duper binomial expansion from here. If the expression is

(1 + Q)^m/n

the expansion is

1 + (m/n)Q + (m/n)((m-n)/2n)Q² + (m/n)((m-n)/2n)((m-2n)/3n)Q³ + ..

We only need the first two terms because Q² is (h/x)² which we will ignore. So we have for the expansion

(1 + h/x)^1/2
= 1 + 1/2(h/x)

and the derivative is the limit as h => 0 of

1/h [x^1/2(1 + 1/2(h/x)) - x^1/2]
= 1/h (1/2 h) 1/x^1/2
= 1/2 x^-1/2

Similarly for m = -2 and n = 1 the first two terms are 1 and -2 and the expansion we need will be

(1 + (h/x))^-2 = 
= 1 - 2(h/x)

The derivative is the limit as h => 0 of

1/h [x^-2 (1 - 2(h/x)) - x^-2] =
= 1/h x^-2 (-2)(h/x)
= -2 x^-3

Easy when you know how.

Basic Python for Bioinformatics

I've been silent online for the last few months, due to other commitments associated with my day job. But I haven't forgotten all of you in the ether.

Just before I got busy, I wrote a book about basic programming with Python using lots of biological examples. It's been proofed once or twice, but is not quite a beta version, so let's call it the gamma release. I want to push it out now, even it could certainly use some more polish, because it's not clear when I'll get the time to do that.

The target audience is the novice programmer who is interested in biology. However, it differs from most similar material because there are lots of biological and sequence analysis examples, collected from various sources including posts on this site.

Here is a screenshot of the toc for the middle section of the book.

The book is in html format (link to Dropbox zip). It was built using the Sphinx software. One can generate a pdf version of the book as well, but it suffers from the major limitation that cut-and-paste of the code loses the indentation. The version linked to here has code examples that can be pasted directly into a text file and executed, and often will work in the interpreter as well.

I hope you like it. I would be very grateful for comments, criticisms, corrections, and even complaints. Post 'em as comments here, or find my work email through the aboutme. Enjoy!

Wednesday, August 31, 2011

Binary multiplication, again

I had a lot of fun this afternoon, but all along I knew I was "reinventing the wheel." It's easy when you know where you're going.

I implemented binary multiplication with a class that keeps its data as Python strings like '00100011'. It's probably because I have been re-reading Charles Petzold's book Code, which I think is a classic. In any event, I wrote the class "word" which allows objects to add, multiply and exponentiate themselves. The second file is test.py which exercises the class.

It's a great "simple Python" coding project. We do bit-shifting by hand. If you want to be slick, you could go further by trying to implement fast exponentiation. Here is another example by a guy who obviously knows what he's doing.

A bit of sample output:

> python test.py
x     467
y     327
x + y 00000000 00000000 00000011 00011010 = 794
check 467 + 327 = 794
x * y 00000000 00000010 01010100 10000101 = 152709
check 467 * 327 = 152709
z      3
x**z  00000110 00010010 00010010 00001011 = 101847563
check 467 ** 3 = 101847563

x     995
y     402
x + y 00000000 00000000 00000101 01110101 = 1397
check 995 + 402 = 1397
x * y 00000000 00000110 00011010 01110110 = 399990
check 995 * 402 = 399990
z      3
x**z  00111010 10110111 00001100 10111011 = 985074875
check 995 ** 3 = 985074875

x     897
y     490
x + y 00000000 00000000 00000101 01101011 = 1387
check 897 + 490 = 1387
x * y 00000000 00000110 10110100 11101010 = 439530
check 897 * 490 = 439530
z      4
x**z  10111011 11001001 10001110 00000001 = 3150548481
ran into a slight problem

word.py

class word:
    SZ = 32
    def __init__(self,n):
        if type(n) == type('a'):
            b = n.rjust(word.SZ,'0')
        elif type(n) == type(1):
            b = bin(n)[2:]
        else:
            raise ValueError, "can't do that"
        self.b = b.rjust(word.SZ,'0')
        
    def __repr__(self):
        b = self.b
        s = ' '.join([b[:8], b[8:16], b[16:24], b[24:]])
        s += ' = ' + str(eval('0b' + self.b))
        return s
        
    def __add__(self, a):
        carry = '0'
        rL = list()
        for i in range(len(self.b)-1,-1,-1):
            x = self.b[i]
            y = a.b[i]
            t = x+y
            if (t == '01' or t == '10'):
                if carry == '0':
                    rL.append('1')
                if carry == '1':
                    rL.append('0')
            elif t == '00':
                rL.append(carry)
                carry = '0'
            else:
                assert t == '11'
                if carry == '0':
                    rL.append('0')
                    carry = '1'
                else:
                    rL.append('1')
                    carry = '1'
        if carry == '1':
            raise ValueError, 'overflow'
        rL.reverse()
        return word(''.join(rL))
    
    def __mul__(self, a):
        L = list()
        for i in range(len(self.b)-1,-1,-1):
            x = self.b[i]
            if not x == '1':
                continue
            n = word.SZ - i - 1
            r = word(a.b[n:] + '0'*(n))
            L.append(r)
        res = L.pop(0)
        #print 'start:       ', res
        while L:
            next = L.pop(0)
            #print 'next:        ', next
            res = res + next
            #if L:
                #print 'intermediate:', 
            #else:
                #print 'result:      ',
            #print res
        return res

    def __pow__(self, n):
        res = self
        for i in range(n-1):
            res = res * self
        return res

test.py

import random
from word import word
        
R = range(1000)
N = 10
for i in range(N):
    x = random.choice(R)
    y = random.choice(R)    
    xw = word(x)
    print 'x    ', x
    yw = word(y)
    print 'y    ', y
    
    print 'x + y',
    r = xw + yw
    print r
    S = x+y
    assert S == int(str(r).split()[-1])
    print 'check', x, '+', y, '=', S
    
    print 'x * y',
    r = xw * yw
    print r
    P = x*y
    assert P == int(str(r).split()[-1])
    print 'check', x, '*', y, '=', P
    
    for i in range(2,N*2):
        try:
            r = xw**i
        except ValueError:
            i = i-1
            break
    
    z = i
    print 'z     ', z
    print 'x**z ', r
    E = x**z
    try:
        assert E == int(str(r).split()[-1])
    except AssertionError:
        print 'ran into a slight problem\n'
        continue
    print 'check', x,'**', z, '=', E
    print

Dissecting RSA keys in Python (4)

Still working with RSA keys. I accomplished the most important of the items left undone so far, to use the keys to encrypt a message using my own code.

In the process, I learned a new simple Python fact: the pow function (which is now a built-in), can take a modulus as a third argument.

And, it works where this doesn't seem to: (x**y) % z

(I discovered this by digging into the rsa module source).

python -m timeit -s "import rsa;  \
f = open('id_rsa');  data = f.read(); f.close(); k = rsa.PrivateKey.load_pkcs1(data)"\
 "41330915578951772302369**k.e % k.n"

10000 loops, best of 3: 79.4 usec per loop

python -m timeit -s "import rsa;  \
f = open('id_rsa');  data = f.read(); f.close(); k = rsa.PrivateKey.load_pkcs1(data)"\
 "pow(41330915578951772302369, k.e, k.n)"

10000 loops, best of 3: 70.2 usec per loop

In the example here, both work and have about the same timing. But in the code below, the first version hangs when doing decryption (with a large base). Here's the output, followed by the script.

> python encode.py 
m:   Hello, secret world!
p:   .xyz.Hello, secret world!.xyz.
a:   413309155789517723023698766343791993289928631329
c:   11434905702482726455415220687715293368262190253795 ..
i:   413309155789517723023698766343791993289928631329
r:   Hello, secret world!

encode.py

import rsa

with open('id_rsa') as f:
    data = f.read()
k = rsa.PrivateKey.load_pkcs1(data)
n = k.n
e = k.e
d = k.d

def my_atoi(s):
    L = [ord(c) for c in s]
    k = 256
    iL = L[:]
    iL.reverse()
    x = iL[0]
    for i in iL[1:]:
        x += i*k
        k *= 256
    return x

def my_itoa(i):
    rL = list()
    while i:
        rL.append(i%256)
        i = i/256
    rL.reverse()
    return ''.join([chr(n) for n in rL])

def encrypt(m):
    return m**e % n
    
def decrypt(c):
    # note:  c**d % n fails
    return pow(c,d,n)
    
if __name__ == '__main__':
    m = 'Hello, secret world!'
    pad = '.xyz.'
    p = pad + m + pad
    a = my_atoi(m)
    c = encrypt(a)
    i = decrypt(c)
    r = my_itoa(i)
    r = r.replace(pad,'')

    L = zip('mpacir',[m,p,a,c,i,r])
    N = 50
    for varname, var in L:
        s = str(var)
        print varname + ':  ', s[:N],
        if len(s) > N:  print '..'
        else:  print

Dissecting RSA keys in Python (3)

I'm working on understanding the structure of RSA keys as output by the ssh-keygen utility (previous posts here and here). At the risk of becoming redundant, I want to try to simplify what we did previously. Let's begin by rehashing old material on bytes in Python:

It's easy to get confused between a byte and its string representation, or for that matter, between an unsigned int and its string representation. Although hexadecimal is natural too, it's perhaps easiest to think of an individual byte in terms of the decimal equivalent (0..255). That's because a Python int converts easily to chr, bin, and hex.

>>> i = 35
>>> c = chr(i)
>>> c
'#'
>>> i == ord(c)
True
>>> b = bin(i)
>>> b
'0b100011'
>>> h = hex(i)
>>> h
'0x23'

Notice that hex and bin both give unpadded output:

>>> hex(1)
'0x1'
>>> bin(1)
'0b1'

Don't be fooled. h and b here are 'str' datatypes. But these string representations of bin and hex values go back to int easily (though we must specify the old base explicitly):

>>> int(b,2)
35
>>> int(h,16)
35

To interconvert hex and bin, it's easiest to go through int:

>>> bin(35)
'0b100011'
>>> hex(35)
'0x23'
>>> hex(int(bin(35),2))
'0x23'
>>> bin(int(hex(35),16))
'0b100011'

The struct module

This module performs conversions between Python values and C structs represented as Python strings. This can be used in handling binary data stored in files or from network connections, among other sources. It uses Format Strings as compact descriptions of the layout of the C structs and the intended conversion to/from Python values.

unpack works on the particular string representation of a byte in which it is a single character:

>>> i = 35
>>> c = chr(i)
>>> c
'#'
>>> unpack('B',c)
(35,)
>>> b = bin(i)
>>> unpack('B',b)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
struct.error: unpack requires a string argument of length 1
>>>

To construct a multi-byte value like a 4-byte unsigned int we can do this:

>>> h = '\x23'
>>> type(h)
<type 'str'>
>>> b_little = '\x23\x00\x00\x00'
>>> type(b_little)
<type 'str'>

With multi-byte objects (like ints) we get into endian-ness---'I' is the format for an unsigned int, and '>' means big-endian:

>>> unpack('I', b_little)
(35,)
>>> b_big = '\x00\x00\x00\x23'
>>> unpack('I', b_big)
(587202560,)
>>> unpack('>I', b_big)
(35,)

One can also use the binascii module, but this is more complicated. Let's write two bytes to a file and then read the binary data:

>>> fn = 'temp'
>>> FH = open(fn,'w')
>>> FH.write('##')
>>> FH.close()
>>> FH = open(fn,'rb')
>>> data = FH.read()
>>> FH.close()

This "data" is still a string:

>>> type(data)
<type 'str'>
>>> data
'##'
>>> len(data)
2
>>> b2a_hex(data[0])
'23'
>>> b2a_hex(data)
'2323'
>>> b2a_hex('#Az')
'23417a'

Or you can get fancy, but it seems unnecessary..

>>> import array
>>> L = array.array('B','#Az')
>>> type(L)
<type 'array.array'>
>>> L[0]
35
>>> L[1]
65
>>> L
array('B', [35, 65, 122])
>>> b2a_hex(L)
'23417a'

With all this in mind, we can redo the script from yesterday to read the base64-encoded key data in a way that is simply understandable. The output first:

> python read.py 
dlen 7
dlen 1
dlen 129

209,8,39,27 .. 169,97,145,127
0xd1,0x8,0x27,0x1b .. 0xa9,0x61,0x91,0x7f 

1467871546 .. 1722964351

We decode the data and break it into 3 parts exactly as before. Looking at the third part, we unpack it byte by byte. The result of unpacking is a list of ints. We convert that list directly into the large number n (by doing b * 256**i for each byte, moving along the list in reverse order), or we can convert the ints into hex values, assemble that list into one long hex string, and then call eval as before. It seems perfectly transparent now.

It even seems clear (in retrospect) why there is an additional null byte after the int that specifies the size of the third segment. It is probably to align the base64 encoding to begin with the first data byte.

The only remaining difficulty is that this approach fails for the private key. So that's still a mystery.

read.py

import base64
from struct import unpack

FH = open('data.txt','r')
data = FH.read()
FH.close()
b64_data = ''.join(data.strip().split())
data = base64.b64decode(b64_data)

L = list()
while data:
    dlen = unpack('>I',data[:4])[0]
    L.append(data[4:dlen+4])
    data = data[dlen+4:]
    print 'dlen', dlen
print

# let's look at n
bL = L[2]
bL = bL[1:]    # extra null first, why?
iL = [unpack('B',b)[0] for b in bL]
hL = [hex(i) for i in iL]

# look at the int and hex values
pL = [str(n) for n in iL]
print ','.join(pL[:4]), '..', ','.join(pL[-4:])
print ','.join(hL[:4]), '..',
print ','.join(hL[-4:]), '\n'

# first just do the computation ourselves
m = 256
iL.reverse()
n = iL[0]
for i in iL[1:]:
    n += i*m
    m *= 256
    
s = str(n)
print s[:10], '..', s[-10:]
    
# hex doesn't pad the output
# remove the '0x' and add the extra '0' if needed
for i in range(len(hL)):
    b = hL[i]
    if len(b) < 4:
        hL[i] = '0' + b[-1]
    else:
        hL[i] = b[-2:]  
h = '0x' + ''.join(hL)
nh = eval(h)
assert n == nh

Tuesday, August 30, 2011

Dissecting RSA keys in Python (2)

The public key from last time was given (in id_rsa.pub) as:

AAAAB3NzaC1yc2EAAAABIwAAAIEA0QgnG+LBQosx
pbkJIU0eWV9Zf7L63fy3BXp6T9RQH84yISs8SMvF
V4J1NYJQtNpGopNFlvzsuFnnxaRaM8ureFIQVa8k
/dRzv3r7QyIuoGyNQLDk3K8Tse/55fjuIxJFEf5j
voJs3Z/jYdPHreg8wzNCB/uMkbCccYsVt6lhkX8=

(It has been broken into 40 character lines). If we look at the first 32 characters of this base64 encoding:

AAAAB3NzaC1yc2EA
AAABIwAAAIEA0Qgn

the decoded version of this in hex is:

0x0  0x0  0x0 0x7  0x73 0x73 0x68 0x2d 0x72 0x73 0x61 0x0
0x0  0x0  0x1 0x23 0x0  0x0  0x0  0x81 0x0  0xd1 0x8  0x27

The first four bytes are

0x0 0x0 0x0 0x7

interpreted as a 32-bit unsigned int that is equal to 7, and indicates that the next 7 bytes are grouped together to form this element of the data (which is 'rsa-ssh', as I mentioned before).

0x73 0x73 0x68 0x2d 0x72 0x73 0x61
   s    s    h    -    r    s    a

The next 4 bytes after that are:

0x0 0x0 0x0 0x1

indicating the following block is a single byte of data: 0x23. In ASCII-encoding this would be the character '#', but it's actually an unsigned int:

>>> int('0x00000023', 16)
35

This is the value of the exponent for the key pair. More on this in a moment. Finally we get to

0x0 0x0 0x0 0x81

>>> int('0x00000081', 16)
129

It turns out there are exactly 129 bytes left in the data. In the code from the first post (from here), we used the struct module to unpack these three segments of data (we labeled them "parts").

According to the docs, the struct module can be used to interpret strings as packed binary data.

The first argument to unpack is a format string. The first character of the format string can be used to indicate the byte order, size and alignment of the packed data

The '>' indicates big-endian data, data laid out in registers in the same order as it is in memory. While Intel x86 (and current Macs) are little-endian, network order is big-endian and (I suppose) the SSH standard respects this. The next format character 'I' indicates the data following are unsigned ints.

import struct

hL = ['0x0','0x0','0x0','0x81']
iL = [int(h,16) for h in hL]
cdata = ''.join([chr(n) for n in iL])
res = struct.unpack('>I', cdata)

>>> print res
(129,)

It returns a tuple and we want the first value.

In the code from yesterday, we simply copied the correct number of bytes in each segment (as "characters"---not ASCII).

In the last part of this example, we need to transform those bytes into numerical data. We'll use struct.unpack again, except this time specifying binary data. For the exponent we had a single byte '\x23'', corresponding to character '#' which is 35 as an unsigned int:

>>> struct.unpack('B','#')
(35,)

Our function f passed the first element of this tuple back to a big list comprehension.

e = eval('0x' + ''.join(['%02X' % f(x) for x in parts[1]]))

In the case of the exponent (e) it operated on only a single value, so I'll rewrite it to clarify what this does. Remembering that f(x) will return 35, we'll get rid of the list stuff and just do:

>>> s = '%02X' % 35
>>> s
'23'
>>> e = eval('0x' + s)
>>> 
>>> e
35

(I'm not too swift with format strings, but you can read about them here).

And it doesn't do much in this case! But suppose we had two of these bytes in a row. Then:

>>> s = ('%02X' % 35) * 2
>>> s
'2323'
>>> eval('0x' + s)
8995
>>> 35*256 + 35
8995

eval will turn our string of bytes into a multi-byte unsigned int. This is news to me. So, if we actually had the string of bytes in hex divided into two sets of bytes, those up to and including '0x81'

['0x0', '0x0', '0x0', '0x7', '0x73', '0x73', '0x68', '0x2d', '0x72', '0x73', '0x61', '0x0', '0x0', '0x0', '0x1', '0x23', '0x0', '0x0', '0x0', '0x81']

and those after:

L = ['0x0', '0xd1', '0x8', '0x27', '0x1b', '0xe2', 
      '0xc1', '0x42', '0x8b', '0x31', '0xa5', '0xb9', '0x9', '0x21', '0x4d', '0x1e', '0x59', 
      '0x5f', '0x59', '0x7f', '0xb2', '0xfa', '0xdd', '0xfc', '0xb7', '0x5', '0x7a', '0x7a', 
      '0x4f', '0xd4', '0x50', '0x1f', '0xce', '0x32', '0x21', '0x2b', '0x3c', '0x48', '0xcb', 
      '0xc5', '0x57', '0x82', '0x75', '0x35', '0x82', '0x50', '0xb4', '0xda', '0x46', '0xa2', 
      '0x93', '0x45', '0x96', '0xfc', '0xec', '0xb8', '0x59', '0xe7', '0xc5', '0xa4', '0x5a', 
      '0x33', '0xcb', '0xab', '0x78', '0x52', '0x10', '0x55', '0xaf', '0x24', '0xfd', '0xd4', 
      '0x73', '0xbf', '0x7a', '0xfb', '0x43', '0x22', '0x2e', '0xa0', '0x6c', '0x8d', '0x40', 
      '0xb0', '0xe4', '0xdc', '0xaf', '0x13', '0xb1', '0xef', '0xf9', '0xe5', '0xf8', '0xee', 
      '0x23', '0x12', '0x45', '0x11', '0xfe', '0x63', '0xbe', '0x82', '0x6c', '0xdd', '0x9f', 
      '0xe3', '0x61', '0xd3', '0xc7', '0xad', '0xe8', '0x3c', '0xc3', '0x33', '0x42', '0x7', 
      '0xfb', '0x8c', '0x91', '0xb0', '0x9c', '0x71', '0x8b', '0x15', '0xb7', '0xa9', '0x61', 
      '0x91', '0x7f', '0x0']

Note: The extra byte at the end here is caused by our having decoded by hand.

Paste it into the interpreter. Then, they have to be padded out...

for i in range(len(L)):
    s = L[i]
    c = s[-1]
    if len(s) < 4:  
         L[i] = '0' + c
    else:  
        L[i] = s[-2:]

Remove the first and last bytes and turn it into one long hex value:

>>> s = '0x' + ''.join(L[1:-1])
>>> eval(s)
146787154640181728129549360865206451974125178122992752499327844555082827713683060312609720092642289502088305950292885968241446393671268243539585900329449529436170858131743078225065287541399805133121852823856839448386268013284889428740476278604926908637093742329548018673369795976229837442944484597101722964351L

That matches what we got yesterday, and what the rsa module gave us for n from the private key.

This use of eval seems to be widely known, but I would never have appreciated it from just reading the docs.

Here is a quick script that shows all the values for each set of four characters in the base64 encoding, first 9 bytes of output first:

> python script2.py 
['A', 'A', 'A', 'A']
[0, 0, 0, 0]
['0b0', '0b0', '0b0', '0b0']
['000000', '000000', '000000', '000000']
['00000000', '00000000', '00000000']
0x0 0x0 0x0
0 0 0 

['B', '3', 'N', 'z']
[1, 55, 13, 51]
['0b1', '0b110111', '0b1101', '0b110011']
['000001', '110111', '001101', '110011']
['00000111', '01110011', '01110011']
0x7 0x73 0x73
7 115 115 

['a', 'C', '1', 'y']
[26, 2, 53, 50]
['0b11010', '0b10', '0b110101', '0b110010']
['011010', '000010', '110101', '110010']
['01101000', '00101101', '01110010']
0x68 0x2d 0x72
104 45 114

from string import *
v = True

cL = list(uppercase + lowercase + digits + '+/')
D = dict(zip(cL,range(len(cL))))
D['='] = 0

key = '''AAAAB3NzaC1yc2EAAAABIwAAAIEA0QgnG+LBQosxpbkJIU0eWV9Zf7L63fy3BXp6T9RQH84yISs8SMvFV4J1NYJQtNpGopNFlvzsuFnnxaRaM8ureFIQVa8k/dRzv3r7QyIuoGyNQLDk3K8Tse/55fjuIxJFEf5jvoJs3Z/jYdPHreg8wzNCB/uMkbCccYsVt6lhkX8='''

L = list(key)

while L:
    s = L[:4]
    L = L[4:]
    if v:  print s
    dL = [D[c] for c in s]
    if v:  print dL
    dL = [bin(c) for c in dL]
    if v:  print dL
    bL = list()
    for b in dL:
        n = 8 - len(b)
        b = n*'0' + b[2:]
        bL.append(b)
    if v:  print bL
    b = ''.join(bL)
    bL = [b[:8], b[8:16], b[16:]]
    if v:  print bL
    iL = [int('0b' + b,2) for b in bL]
    for i in iL:
        print hex(i),
    print
    for i in iL:
        print i,
    print '\n'

Monday, August 29, 2011

Dissecting RSA keys in Python

I want to look at RSA keys a little bit more in this post. Let's generate a public/private pair (1024 bits) using the SSH utility keygen (no passphrase):

> ssh-keygen -b 1024 -t rsa 
Generating public/private rsa key pair.
Enter file in which to save the key (/Users/telliott_admin/.ssh/id_rsa): id_rsa
Enter passphrase (empty for no passphrase): 
Enter same passphrase again: 
Your identification has been saved in id_rsa.
Your public key has been saved in id_rsa.pub.
The key fingerprint is:
96:49:5d:40:b1:90:de:99:c5:05:43:f2:92:45:69:67 telliott_admin@c-98-236-78-154.hsd1.wv.comcast.net
The key's randomart image is:
+--[ RSA 1024]----+
|        .o=*B+.  |
|        .o B=.E  |
|       ...==.o   |
|       ..o+.     |
|        S        |
|       .         |
|                 |
|                 |
|                 |
+-----------------+

[ To be honest, at this point, I'm not actually sure this is the randomart that goes with the example in this post.. but I think it could be :) ]

Start with the "public" key. The data in the file looks like this:

id_rsa.pub

ssh-rsa AAAAB3NzaC1yc2EAAAABIwAAAIEA0QgnG+LBQosxpbkJIU0eWV9Zf7L63fy3BXp6T9RQH84yISs8SMvFV4J1NYJQtNpGopNFlvzsuFnnxaRaM8ureFIQVa8k/dRzv3r7QyIuoGyNQLDk3K8Tse/55fjuIxJFEf5jvoJs3Z/jYdPHreg8wzNCB/uMkbCccYsVt6lhkX8= my_identifier

Where my_identifier is

telliott_admin@c-myipaddress_with_dashes.hsd1.wv.comcast.net

The common representation of RSA keys (like we see in the file above) uses base64 format. I was confused about this at first, but found a great explanation here, and of course in wikipedia:

The base64, base32, and base16 encodings convert 8 bit bytes to values with 6, 5, or 4 bits of useful data per byte, allowing non-ASCII bytes to be encoded as ASCII characters for transmission over protocols that require plain ASCII, such as SMTP. The base values correspond to the length of the alphabet used in each encoding. There are also URL-safe variations of the original encodings that use slightly different results.

Consider the byte 1001-1110

We can encode this byte using hexadecimal representation as '9e'

>>> b = 0b10011110
>>> b
158
>>> hex(b)
'0x9e'

I didn't actually know you could do this, i.e. using 0b with no quotes (see SO).

The encoding of those 8 bits (one byte) as ASCII characters takes 2 bytes. The '9' and the 'e' take up that much space. An advantage of this representation of the byte is that it's easy to read, also the information can be transmitted to devices that are expecting an ASCII string, where the highest-value bit should be unset.

A disadvantage is that it doubles the size of the string that we're transmitting. As the quote indicates, the other encodings cram more bits of information into the byte to be transmitted. In base32 we could use (RFC4648) the characters 'A'..'Z' plus '2'..'7'.

In base64 we use uppercase, lowercase, digits plus '+' and '/', except that:

Using standard Base64 in URL requires encoding of '+', '/' and '=' characters into special percent-encoded hexadecimal sequences ('+' = '%2B', '/' = '%2F' and '=' = '%3D'), which makes the string unnecessarily longer.
For this reason, a modified Base64 for URL variant exists, where no padding '=' will be used, and the '+' and '/' characters of standard Base64 are respectively replaced by '-' and '_',

The wikipedia article has a great graphic showing an encoding in detail. We do conversions easily by using the Python base64 module.

>>> import base64
>>> e = base64.b64encode
>>> d = base64.b64decode
>>> e('Man')
'TWFu'
>>> d('TWFu')
'Man'
>>> s = 'Fourscore, and seven years ago'
>>> len(s)
30
>>> c = e(s)
>>> c
'Rm91cnNjb3JlLCBhbmQgc2V2ZW4geWVhcnMgYWdv'
>>> d(c)
'Fourscore, and seven years ago'

The encoding process looks at each sequence of 24 bits in the input (3 bytes) and encodes those same 24 bits spread over 4 bytes in the output. The last two characters [in the next example], the ==, are padding because the number of bits in the original string was not evenly divisible by 24 in this example.

>>> e('Ma')
'TWE='

We break this two byte (16 bit) value up into 6 + 6 + 6 = 18 bits---that makes three base64 values, and the third one is now E rather than F because the bit pattern is 000100, where the last two 00 are padding. Additional adding is added to bring us back to the byte boundary.

In Lincoln's famous phrase, there are 30 bytes, which is evenly divisible by 6, hence no padding is needed.

The second part of the problem was solved by a question I found on SO

Here is my version of the code:

import base64
import struct

data = open('id_rsa.pub').read().split(None)[1]
print data[:24]
keydata = base64.b64decode(data)
print keydata[:18]

parts = []
while keydata:
    dlen = struct.unpack('>I',keydata[:4])[0]
    data, keydata = keydata[4:dlen+4], keydata[4+dlen:]
    parts.append(data)

def f(x):
    return struct.unpack('B', x)[0]
    
e = eval('0x' + ''.join(['%02X' % f(x) for x in parts[1]]))
print 'e', e

n = eval('0x' + ''.join(['%02X' % f(x) for x in parts[2]]))
print 'n', n

I have to confess I haven't sorted this out completely yet. Nevertheless, it allows us to convert the RSA public key from ssh-keygen to an integer:

Output:

> python script.py
AAAAB3NzaC1yc2EAAAABIwAA
ssh-rsa#
e 35
n 146787154640181728129549360865206451974125178122992752499327844555082827713683060312609720092642289502088305950292885968241446393671268243539585900329449529436170858131743078225065287541399805133121852823856839448386268013284889428740476278604926908637093742329548018673369795976229837442944484597101722964351

I want to finish by showing that these values are actually correct, and tie the whole thing back to an earlier discussion of the mathematics of public key cryptography.

I found a module for Python that handles RSA cryptography.

One very nice thing about it is that it's pure Python. It might not be fast, but there's nothing to do to build it. (I used easy_install). It turns out that the rsa module can read the private RSA key that we got earlier (top of the post), but not the public key (which is why I learned everything that's come before this point):

> python
Python 2.7.1 (r271:86832, Jun 16 2011, 16:59:05) 
[GCC 4.2.1 (Based on Apple Inc. build 5658) (LLVM build 2335.15.00)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import rsa
>>> with open('id_rsa') as f:  data = f.read()
... 
>>> data.split()[4][:50]
'MIICWgIBAAKBgQDRCCcb4sFCizGluQkhTR5ZX1l/svrd/LcFen'
>>>
>>> k = rsa.PrivateKey.load_pkcs1(data)
>>> k.n
146787154640181728129549360865206451974125178122992752499327844555082827713683060312609720092642289502088305950292885968241446393671268243539585900329449529436170858131743078225065287541399805133121852823856839448386268013284889428740476278604926908637093742329548018673369795976229837442944484597101722964351L
>>> k.e
35
>>> k.d
71296617968088267948638260991671705244575086516882194071102095926754516318074629294696149759283397758157177175856544613145845391211758861147798865874304045384903321200809536161346504709634159813307741235395231601606901597492591281205810328850837168587984349087962868939575631849929827946396907884195208774355L
>>> k.p
12214944519618901215070692893221534580187337690484280451921546594832793966376942655508370083135703324858458340368238021755353370324360760471666368429071147L
>>> k.q
12017013618393528112233953119917677248859562342149056793052943111068006176296374093622600767249370271694572460567324457033873003984957249202580919628769533L
>>> k.n == k.p * k.q
True
>>>

You can see that n equals p * q and that it matches what we eventually obtained from the public key file (id_rsa.pub) above. That's pretty cool.

Still to do: use the keys to do encryption and decryption and check that the math is right. Figure out how the hash/digest is done. And figure out how the passphrase is used to modify the private key file. We can even add a passphrase at this point to the old version:

ssh-keygen -f id_rsa -p

I used 'abcde' as a passphrase (weak!):

> cat id_rsa
-----BEGIN RSA PRIVATE KEY-----
Proc-Type: 4,ENCRYPTED
DEK-Info: AES-128-CBC,FB4D3BD371510C4B37552EF6E949CC48

E+KtbKVAwE4Dxj3vKYgfHpbgqF9Nk6O4vJ+C6wogXEHXr8vkM9SF5pz6lKNfGRbC
S93/lurvq+1+rCuldDNmEjdy36cbgrWJ1CIwB6fq5VyT3+v+AxDMqV9nQkvHjnxR
/ON2n+CjOSklnvmaBk3l8+IvCIKar2hGS/1qHqJRKJ3B1fgd5I8i9yH6hx8yl8Qm
AfVPgBJ4+91dCW2Y8odtXENgjstGtWZkYd0O+EqPuSuPHLu+J7M2KSQNe8aKOhGf
DXCEwR4aGW8EXfRNQpULbi5JruYebYlg3JILm2MaFwy+OW39q1VElvBc5wD01awW
BF7Y9x++bi9krX/G49NzkLdSVshkDMSP+eRZ4sncWvDCrkN2xlWqnNfQ06I+XCVW
FMUmyMFI87eNmjcZOgIi1S+fQtWcbl4o9Hx1z6xcMd7WxOnfIxe5zHIu8C24WONW
spcVxSsHAA57jYa8j292prtK6HDDa6UOAa32hBz1lvC8fMWPxwluTJ8eyeeK/7O6
hNI5G9gM4MsMEMau8O49fEm0RuiT/aB8IXZeEVQNbV/FdixyBBhY53DMnFQ5wau8
OJCpL51S5lj2o8eAJc61SHwX/Fc9uF+rmAFy08tthLuCF2wGdZWsE7IJ8NQzFJA+
Us2dTCjY57V/QoCC41rWh+YP+srcEK1Wp7DfvXcq98N1/+tECyCe/aDdPCIM/8QD
glKrx8Ah7Tsx6of3vCHiYG0q8Yu68NNJOiWrnGMEmP0Gq9yhLYqKrc8a3oIP0Lst
Ss+0p989df73GdM3nyqtLUwAzaCJQnDivLFT6gltEnM=
-----END RSA PRIVATE KEY-----

Sunday, August 28, 2011

Duly quoted

Recycled from an earlier post:

Programs must be written for people to read, and only incidentally for machines to execute.

- Abelson & Sussman, SICP

How do we convince people that in programming simplicity and clarity—in short: what mathematicians call "elegance"— are not a dispensable luxury, but a crucial matter that decides between success and failure?

- E. W. Dijkstra

Linux Server SSH

To continue with my Ubuntu Linux server odyssey, I got SSH working this afternoon for access from my LAN. As always, heed the warnings, these posts are without a warranty of any kind, not least because I am a total amateur. They exist because it's easy to have Google to "index my brain" or rather, what was in my brain at a certain time.

The objective is to enable remote login for my Ubuntu machine running with a local ip address of 10.0.1.2. We begin by setting a stronger password for my account on Ubuntu under: System > Preferences > About Me (weirdly, it's not under Passwords and Encryption Keys ..).

On the client (OS X Snow Leopard Server), I already have some key pairs from a previous test

> ls ~/.ssh
authorized_keys	id_dsa		id_dsa.pub	known_hosts

but they are DSA keys, which the notes I'm going to follow deprecate. We'll make a short 1024 ~~byte~~ bit RSA pair for now:

ssh-keygen -b 1024 -t rsa

> cat id_rsa.pub 
ssh-rsa AAAAB..5Eoec= telliott_@c-98-___-__-154.hsd1.wv.comcast.net

Not sure what the pseudo ip address or "comcast" thing is about.

I also take this opportunity to set a passphrase for the public key (and of course I make a note of it). This will be useful (it says) because we can set up the server to require both a password to gain access to the public key, as well as the corresponding private key before login is allowed.

[UPDATE: I'm still working my way through this, but I suspect that the previous statement isn't correct. The passphrase is used to protect the value of the private key. So, in this setup OS X will decrypt the private key using the passphrase, and the stored value of the private key in id_rsa is encrypted. I'll try to figure all this out soon. ]

The easiest way to get my new public key onto the server is to use SSH (with password, before we disable it). The next question is, do we need to install an SSH server? It appears yes.

sudo apt-get install openssh-server

There is a pre-existing config file /etc/ssh/ssh_config but after the previous command there are more including sshd_config. I make sure to save a copy of this file before I modify it. Now try:

ssh te@10.0.1.2

ssh: connect to host 10.0.1.2 port 22: Connection refused

OK, so we need to modify /etc/ssh/sshd_config. Port22 is already uncommented. Now uncomment:

PermitRootLogin no
ChallengeResponseAuthentication yes
PasswordAuthentication yes   # we'll set it to no eventually

On the server:

sudo /etc/init.d/ssh restart

On the client:

> ssh te@10.0.1.2
The authenticity of host '10.0.1.2 (10.0.1.2)' can't be established.
RSA key fingerprint is d1: .. :83.
Are you sure you want to continue connecting (yes/no)?

Check the fingerprint on the the server:

ssh-keygen -l -f /etc/ssh/ssh_host_rsa_key.pub

2048 d1: .. :83 /etc/ssh/ssh_host_rsa_key.pub (RSA)

Are you sure you want to continue connecting (yes/no)? yes
Warning: Permanently added '10.0.1.2' (RSA) to the list of known hosts.
Connection closed by 10.0.1.2

> ssh te@10.0.1.2
te@10.0.1.2's password: 
Welcome to Ubuntu 11.04 (GNU/Linux 2.6.38-11-generic x86_64)

Now to change to not using our password. We need to copy our RSA public key to the server in some secure way. According to the sshd_config file, the "authorized keys file" is
%h/.ssh/authorized_keys. That's in my home directory. I do:

> scp ~/.ssh/id_rsa.pub te@10.0.1.2:~/.ssh/authorized_keys
te@10.0.1.2's password: 
id_rsa.pub                                    100%  260     0.3KB/s   00:00    

te@VB:~$ logout
Connection to 10.0.1.2 closed.

> ssh te@10.0.1.2
Identity added: /Users/telliott_admin/.ssh/id_rsa (/Users/telliott_admin/.ssh/id_rsa)
Welcome to Ubuntu 11.04 (GNU/Linux 2.6.38-11-generic x86_64)

* Documentation:  https://help.ubuntu.com/

Last login: Sun Aug 28 14:35:49 2011 from osxserver.local
te@VB:~$

Now, finally, be sure to turn off password authentication for the ssh server: PasswordAuthentication no. At first, I'm not being prompted for a password to retrieve my public key on the server.. just getting automatically logged in by:

ssh te@10.0.1.2

I forgot to do:

ChallengeResponsePasswords yes

sudo /etc/init.d/ssh restart

Restart and I get the challenge.. Although it is apparently cached sometimes. Looks like it works.

Linux Server

A couple of weeks ago I described installation of Ubuntu Linux (11.04), running under Virtual Box on Mac OS X Lion. I got matplotlib (and scipy) installed and running without any difficulty using apt-get alone.

Next, I installed Apache Server 2.2 and configured everything to run Python scripts by using Firefox from Ubuntu on the VM like this:

localhost/cgi-bin/test.py

After struggling a bit, I figured out how to use Virtual Box in NAT mode to do "port forwarding"---by modifying

/etc/apache2/sites-available/default

so then I could launch the script from the host OS by doing

localhost:8080/cgi-bin/test.py

from Safari or by using curl in Terminal.

The next step was to try to access the server from a second machine on my home network. In order to do that, (as I understand it) I can either remain in NAT mode and try to connect via the host machine and port-forwarding, or use "bridging" mode to effectively put the Ubuntu server on the network.

In order to accomplish this, I need to know the ip address of the server as assigned by my Apple Time Capsule router. Now the problem is that this ip address may change with time (since we're getting it from the Time Capsule as a DHCP server). According to the folks I asked on serverfault, I need to switch the Ubuntu network IPv4 settings to manual and ask for a specific ip address and this is correct so far as it goes.

However, the address I ask for in Ubuntu

has to match the range of addresses that the router is set up to hand out. Furthermore, we probably want to have most ip addresses handled dynamically, and only use a static ip for the one machine. If we wanted them all static, we'd go here:

I found the answer here but the setting is actually visible in this screenshot:

Under Internet > DHCP there is a TableView called DHCP reservations. I just reserved the first available ip address. After re-booting Ubuntu and restarting the server, I can run the test script or do:

Now I have to figure out how to access the server from the outside world. I should probably remove the test script from cgi-bin (since it reveals information about paths on the server) and read up on firewall configuration, etc. before I do that. The first problem for this part (not yet solved) is that my ip address from my isp, obtained with

checkip.dyndns.org/

does not seem to be accessible. I get a "Safari can't connect to the server .." message.

Saturday, August 27, 2011

Misunderstanding Euclid

Category: things you thought you knew that are wrong
Today's entry is from mathoverflow:

Many students believe that 1 plus the product of the first n primes is always a prime number. They have misunderstood the contradiction in Euclid's proof that there are infinitely many primes. (By the way, 2 * 3 * 5 * 7 * 11 * 13 + 1 is not prime.)

Let's ask Python:

 
from math import sqrt
def f(N):
    N += 1
    print 'N', N
    n = int(sqrt(N))
    for i in range(2, n+1):
        if not N % i:
            print i, N/i

>>> N = 2*3*5*7*11*13
>>> f(N)
N 30031
59 509
>>> N *= 17
>>> f(N)
N 510511
19 26869
97 5263
277 1843
>>> N *= 19
>>> f(N)
N 9699691
347 27953

What the theorem actually says (wikipedia):

Take any finite list of prime numbers p₁, p₂, ..., p_n. It will be shown that at least one additional prime number not in this list exists. Let P be the product of all the prime numbers in the list: P = p₁p₂...p_n. Let q = P + 1.

Then, q is either prime or not:

If q is prime then there is at least one more prime than is listed.

If q is not prime then some prime factor p divides q. This factor p is not on our list: if it were, then it would divide P (since P is the product of every number on the list); but as we know, p divides P + 1 = q. Then p would have to divide the difference of the two numbers, which is (P + 1) − P or just 1. But no prime number divides 1 so there would be a contradiction, and therefore p cannot be on the list. This means at least one more prime number exists beyond those in the list.

Late nights on the web

This is just a short post to document some interesting links. I recently posted about getting matplotlib installed with minimum fuss on OS X Lion, and this seems to have helped quite a few folks.

The person who showed me how to do this is named Rui Carmo, and I finally found the time to begin browsing through his extensive site; he has lots of good stuff about Macs and other technology (like here).

Perhaps the most interesting thing I found is what appears to be a very high quality company called Linode, offering Linux Virtual Servers in the cloud, their docs include an extensive library related to the topic. I'm thinking seriously about trying it (it only costs just a bit more than the NY Times).

And the final result (of browsing the web on a sleepless night) is to note the profusion of different sites connected to Stack Overflow, including serverfault and superuser. Very rewarding to browse through Q & A, and of course, when I have a question of my own.

One more thing: a pdf form-filler link that is not Adobe. One of the sites I use for work has Adobe forms on it that I can only read through Safari if I have Reader installed as the default pdf viewer. Maybe it will help with this, since I loathe Adobe.

Wednesday, August 24, 2011

Pretty code (6)

This is absolutely the last post about html and code stuff. The whole adventure started here. It's a bit amusing to realize that if I had just understood what <br> tags were about (and <span> and <div>), this would never have happened.

I decided to write my own parser for Python code, so that I could post the results here. On the way, I learned something about "styles" and stylesheets. To organize things, I put this code for a "style" in the template for the blog at the end of <head>:

<style type='text/css'>
div.content
{
font-size:12pt;
font-family:arial;
color: #404040;
}

div.code
{ 
font-size:12pt;
font-family:courier;
margin: 20px;
border-style: solid;
padding: 5px;
border-width: 1px;
border-color:#ff00ff;
}

span.kw   { color:blue; }
span.str  { color:red; }
span.str3 { color:purple; }
span.py   { color:#00b0b0; }
span.cm   { color:green; }
</style>

The parser output looks great when run on itself. The first little bit is here:

But, alas, it was not to be. The Blogger editor is now (still) messing with my code, removing all the <br /> that I carefully put in, so that the informative whitespace is gone in the actual post (though not in the intial preview before posting). It looks hopeless at this point.

Files for the parser on Dropbox (here). The only known bugs are those visible in the comment lines above, where the strings should have been masked by being inside a comment.

Unfortunately, I do not know how to make this all work for Blogger.

What a waste! I am not happy with these guys. The only way I can see is to have my own server.

Monday, August 22, 2011

Pretty code (5)

This is the first section of the same example from the last post (here, more explanation here), run through Pygments. It is self-contained and doesn't require loading any other resources.

The code to generate it was:

> pygmentize -f html -O full -o test.html pyparser.py

This is the same parser that is used for the official Python docs, which are generated in combination with Sphinx. I have only a hazy idea of how that's done.

It's easily customizable if you want.

import sys
from keyword import kwlist
from string import punctuation as punct
from utils import load_data
import html_tags as H
pywords = ['True','False','list','dict',
           'int','float','append','extend',
           'sys','argv','pop','open','write',
           'close','string','time']

try:  
    fn = sys.argv[1]
except IndexError:  
    fn = 'example.py'
data = list(load_data(fn))

Pretty code (4)

Explanation of what this is about from last time here. I got triple-quoted strings working, and I'm really pleased with the results. More another time on other options for code-highlighting.


import sys
from keyword import kwlist
from string import punctuation as punct
from utils import load_data
import html_tags as H
pywords = ['True','False','list','dict',
           'int','float','append','extend',
           'sys','argv','pop','open','write',
           'close','string','time']

try:  
    fn = sys.argv[1]
except IndexError:  
    fn = 'example.py'
data = list(load_data(fn))

D = {'is_cm':False,
     'in_str_1':False,
     'in_str_2':False,
     'in_str_3':False }

L = list()

while data:
    c = data.pop(0)
    # comments first
    if c == '#':
        if not (D['in_str_1'] or D['in_str_2']):
            L.extend(list(H.cm_start))
            D['is_cm'] = True
    if c == "\n" and D['is_cm']:
        L.extend(list(H.cm_stop))
        D['is_cm'] = False
    L.append(c)
    
    a = '''triple-quoted-string
    with a continuation and two keywords'''
    
    # triple-quoted strings
    if c == "'" and len(data) > 1:
        if data[0] == "'" and data[1] == "'":
            if (not D['in_str_1'] and not D['in_str_2']):
                if not D['in_str_3']:
                    L.pop()
                    L.extend(list(H.str3_start))
                    L.extend(["'"] * 3)
                    data.pop(0)
                    data.pop(0)
                    D['in_str_3'] = True
                else:
                    data.pop(0)
                    data.pop(0)
                    L.extend(["'"] * 2)
                    L.extend(list(H.str3_stop))
                    D['in_str_3'] = False
                    continue
    
    # single-quoted strings
    if c == "'":
        if D['in_str_3']:
            continue        
        if (not D['in_str_1']):
            if not D['in_str_2']:
                # start a str_1
                L.pop()
                L.extend(list(H.str_start))
                L.append(c)
                D['in_str_1'] = True
            else:
                # already in str_2 or str_3
                pass
        else:
            if D['in_str_1']:
                # terminate str_1    
                L.extend(list(H.str_stop))
                D['in_str_1'] = False
            
    # double-quoted strings
    if c == '"':
        if not D['in_str_2'] and not D['in_str_3']:
            if not D['in_str_1']:
                # start a str_2
                L.pop()
                L.extend(list(H.str_start))
                L.append(c)
                D['in_str_2'] = True
            else:
                # already in str_1 or str_3
                pass
        else:
            if D['in_str_2']:
                # terminate str_2    
                L.extend(list(H.str_stop))
                D['in_str_2'] = False
s = ''.join(L)

# keywords last
pL = list()
D['is_str_3'] = False
for line in s.split('\n'):
    D['is_cm'] = False
    words = line.split()
    for w in words:
        # no kw highlighting in comments
        if w.startswith(H.cm_start):
            D['is_cm'] = True
        if w.startswith(H.str3_start):
            D['is_str_3'] = True
        if H.str3_stop in w:
            D['is_str_3'] = False
        if not D['is_cm'] and not D['is_str_3']:
            if w in kwlist:
                r = H.kw_start + w + H.kw_stop
                line = line.replace(w, r)
            for p in pywords:
                if p in w:
                    L = w.split(p)
                    if L[0] and not L[0][-1] in punct:
                        continue
                    if L[1] and not L[1][0] in punct:
                        continue
                    r = H.py_start + p + H.py_stop
                    line = line.replace(p, r)
    pL.append(line)
    
s = H.br.join(pL)
pL = [H.head, H.hr, s, H.hr, H.tail]
s = '\n'.join(pL)

fn = fn.split('.')[0] + '.html'
FH = open(fn,'w')
FH.write(s + '\n')
FH.close()

Here's a screenshot of html_tags.py:

Pretty code (3)

I worked a bit more on the Python parser that I described last time (here). Actually, I started over from scratch. The first approach was based on breaking the code into words for processing, but I realized that a "stream" approach of one character at a time is better. The only limitation I'm aware of is that I still don't handle triple-quoted strings, but I think it's doable in the future.

As I mentioned before, I'm using <br /> tags in the code now rather than newlines.

The instructions for doing this said to modify the blog template, but I'm afraid of breaking the formatting for old posts. I put the following just below the <body> tag instead, as shown. It seems to work.

<body>

<style>
  cd { font-size:120%; }
  cm { color: green }
  kw { color: blue; }
  str { color: red; }
</style>

Then, to be totally meta about it, I ran the parser on itself. It is set up to grab HTML tags from a separate file (otherwise the meta application chokes on the tags). They are the ones shown above. Plus, there is a head and a tail to make an independent .html document. But for pasting into the blog, you don't need those guys. I invoked it like this:

python simple_parser2.py simple_parser2.py > example.html

And here is the code:


import sys
from keyword import iskeyword
from utils import load_data
import html_tags as H

try:  
    fn = sys.argv[1]
except IndexError:  
    fn = 'example.py'
data = list(load_data(fn))

D = {'is_cm':False,
     'is_str_1':False,'is_str_2':False }

L = list()
for c in data:
    # comments first
    if c == '#':
        if not (D['is_str_1'] or D['is_str_2']):
            L.extend(list(H.cm_start))
            D['is_cm'] = True
    if c == "\n" and D['is_cm']:
        L.extend(list(H.cm_stop))
        D['is_cm'] = False
    L.append(c)
    
    # single-quoted strings
    if c == "'":
        if not D['is_str_1']:
            if not D['is_str_2']:
                # start a str_1
                L.pop()
                L.extend(list(H.str_start))
                L.append(c)
                D['is_str_1'] = True
            else:
                # already in str_2
                pass
        else:
            # terminate str_1    
            L.extend(list(H.str_stop))
            D['is_str_1'] = False
    # double-quoted strings
    if c == '"':
        if not D['is_str_2']:
            if not D['is_str_1']:
                # start a str_2
                L.pop()
                L.extend(list(H.str_start))
                L.append(c)
                D['is_str_2'] = True
            else:
                # already in str_1
                pass
        else:
            # terminate str_2    
            L.extend(list(H.str_stop))
            D['is_str_2'] = False
s = ''.join(L)

# keywords last
pL = list()
for line in s.split('\n'):
    D['is_cm'] = False
    words = line.split()
    for w in words:
        # no kw highlighting in comments
        if w.startswith(H.cm_start):
            D['is_cm'] = True
        if not D['is_cm'] and iskeyword(w):
            r = H.kw_start + w + H.kw_stop
            line = line.replace(w, r)
    pL.append(line)
s = H.br.join(pL)

pL = [H.head, H.hr, s, H.hr, H.tail]
s = '\n'.join(pL)

try:  
    fn = sys.argv[2]
except IndexError:  
    fn = 'example.html'
FH = open(fn,'w')
FH.write(s + '\n')
FH.close()

Sunday, August 21, 2011

Pretty code (2)

So the question is, what are the good ways to format code for blog posts, when you don't control the machine that serves the pages (like on Blogger). The first---and motivating---post in this series is here.

I always prefer simple to complex. I even prefer simple html pages, although I'll ignore the obvious problem that pages are not simple any more when Blogger gets through with them.

So I liked the idea of controlling the color of text in a page with self-defined tags. For example, we can put this in the head of our html document:

It's a seductively simple idea. But.. it requires a way of parsing Python code to generate the appropriate tags. I thought if I enforced some rules that could be done relatively easily.

After a few hours, I have something that sort of works, but it has too many bugs to be useful yet.

The main simplifying rule is that multi-line strings are not allowed. Also, at this point I assume that indentation is a four-space tab.

You can see the results in the graphic (above). I put the script on Dropbox (simple_parser.py). Without a lot more debugging, this seems like it's probably not the best option, but it was an entertaining exercise for the afternoon.

Formatting problem solved

[ UPDATE: So I discovered that the line breaks setting below seems to be retroactive, and changing it messes up the old posts. Another solution is to substitute newlines in the code with <br /> tags. That's what I've done here, and seems OK. This makes it difficult to actually edit code samples in the editor, but them's the breaks. :) ]

On the topic of how to format code samples for the blog, I've been doing some testing and found that the combination of



Settings > Formatting

section:  Convert line breaks

No

and



Settings > Basic

section:  Global > Select post editor

Updated editor

appears to have solved the problems that cropped up (my post here) with my standard approach using <table>. A sample:

def f(n):
print "Hello world!"

And I like it that the new editor's preview mode gives an accurate view of the product.

However, there are other solutions to consider, including how to get syntax highlighting without needing to load resources that are out of my control (my SO question is here), or whether to use a code hosting service and "embed" the code here. I'll have to get back to you on that.

Saturday, August 20, 2011

Bioonductor: summary

[ Update: Sorry for the dup. I tried using the new formatting on this one, but found a problem, which I hope is fixed now.]

I promise this is the last one. Here we load the sample data from ExpressionSet. I show (once again) how to obtain all the relevant stuff: expression data, sample names, feature names, and gene names.

One of the nice things about ExpressionSet objects is the very fancy indexing one can do with them. A great example is given on p. 166 of the Bioconductor book.

I made an stab at implementing range selectors here but it's almost trivial. Also, in this one, the Python code does not contain any informative whitespace, so I've used the <code> tag.
The script is first and then the output. If you can't figure something out, just ask.

script.py

import rpy2.robjects as robjects 
from rpy2.robjects.packages import importr  
import rpy2.rinterface as ri  
import bioc.biobase as biobase 

r = robjects.r 
g = robjects.globalenv 
#gd = importr('grDevices') 
dim = r['dim'] 

r('data("sample.ExpressionSet")') 
sample_ExpressionSet = r['sample.ExpressionSet'] 
eset = sample_ExpressionSet 
#print tuple(ri.globalenv) 
print eset 

print dim(eset) 
ed = eset.exprs 
print ed.rx(1,1) 

# no way to do es[x,y] from Python? 
# just wrap it 
r(''' 
do_sel <-function(obj,sel,bycol=FALSE) {
if (bycol) { return(obj[,sel]) } 
else { return(obj[sel,]) } 
} 
''') 

do_sel = r['do_sel'] 
# both work: 
v = robjects.IntVector([2,3,6]) 
v = robjects.BoolVector([True,False]*13) 
eset_sub = do_sel(eset,v,bycol=True) 
eset_sub = do_sel(eset_sub,v) 
print dim(eset_sub) 

# no explicit featureData, just from data rows 
rownames = r['rownames'] 
fd = rownames(ed) 
print fd[:2] 

# get the gene names 
importr('annaffy') 
r(''' 
get.genenames <- function(probeids) {
symbols <- aafSymbol(probeids, 'hgu95av2') 
getText(symbols) } 
''') 
get_genenames = r['get.genenames'] 
gn = get_genenames(fd) 

# slightly fancy, no dups, original order 
from collections import OrderedDict 
z = zip(gn,[None]*len(gn)) 
D = OrderedDict(z) 
L = D.keys() 
for i in range(4):  print L[i*5:i*5+5] 
print 

pd = eset.do_slot('phenoData') 
data = pd.do_slot('data') 
v = robjects.IntVector(range(1,4)) 
print do_sel(data,v)

> python script.py 
ExpressionSet (storageMode: lockedEnvironment) 
assayData: 500 features, 26 samples  
element names: exprs, se.exprs  
protocolData: none 
phenoData 
sampleNames: A B ... Z (26 total) 
varLabels: sex type score 
varMetadata: labelDescription 
featureData: none 
experimentData: use 'experimentData(object)' 
Annotation: hgu95av2  

Features  Samples  
500       26  

[1] 192.742 

Features  Samples  
3        3  

[1] "AFFX-MurIL2_at"  "AFFX-MurIL10_at" 


******* Deprecation warning *******: 

The package 'hgu95av2' is deprecated and will not be supported in the future.  

Instead we strongly reccomend that you should start using the 'hgu95av2.db' package.  

We hope you will enjoy these new packages.  If you still have questions, you can search  
(http://dir.gmane.org/gmane.science.biology.informatics.conductor) or ask the mailing list  
(http://bioconductor.org/docs/mailList.html).   


Loading required package: hgu95av2 
['', 'STAT1', 'GAPDH', 'ACTB', 'TFRC'] 
['C15orf31', 'ACTL6B', 'GLRA1', 'KCNB2', 'MGAT5'] 
['BMP3', 'RPL14', 'ZCWPW2', 'KITLG', 'GPR12'] 
['TRA@', 'ATP8A2', 'C1orf68', 'FAM20B', 'SLC34A1'] 

sex    type score 
A Female Control  0.75 
B   Male    Case  0.40 
C   Male Control  0.73 

>

Friday, August 19, 2011

Bioconductor: multiple plots

Actually, I have a bit more to do with Bioconductor ExpressionSets, as long as I'm thinking about it. When I first played with the ALL example, I tried other pairs besides ('ALL1/AF4','BCR/ABL')---I'd have to look now, but they are probably in the original paper too. Anyway, in the first talk I gave about this stuff, I made plots of all the pairs. Some are really nice. So that's what I'd like to show now.

The difference is that then, I wrote the code in R, and I did not like the experience.

In Python, we do the standard imports. We factor out the R code into several functions, then load it from a text file in Python, and execute the code in R-space. In Python, we pick pairs to compare and call a plotting function in R.

The R code is a little hackish (I'm thinking of color.map), but this is a good start.

R code:

color.map <- function(g) {
colors = c('red','green','blue',
'cyan','magenta','maroon')
if (g == 'ALL1/AF4') i=1
if (g == 'BCR/ABL') i=2
if (g == 'E2A/PBX1') i=3
if (g == 'NEG') i=4
if (g == 'NUP-98') i=5
if (g == 'p15/p16') i=6
colors[i]
}

library('limma')
library('annaffy')

get.genenames <- function(eset) {
probeids <- featureNames(eset) 
symbols <- aafSymbol(probeids, 'hgu95av2') 
getText(symbols)
}

compare.factors <- function(eset,L) {
eset.sub = eset[,eset$mol.biol %in% L]
f <- factor(as.character(eset.sub$mol.biol))
design <- model.matrix(~f)
fit <- eBayes(lmFit(eset.sub,design))
pv = fit$p.value[,2]
sel <- p.adjust(pv) < 0.001
eset.sel <- eset.sub[sel,]
}

plot.eset <-function(eset,...) {
patient.colors <- unlist(lapply(
eset$mol.biol,color.map))
genenames = get.genenames(eset)
heatmap(exprs(eset),
col=topo.colors(100),
ColSideColors=patient.colors,
labRow=genenames)
}

The R code is in a file Rcode.txt. Python code:

import rpy2.robjects as robjects
from rpy2.robjects.packages import importr 
import bioc.biobase as biobase

r = robjects.r
g = robjects.globalenv
gd = importr('grDevices')

importr('ALL')
r('data("ALL")')
ALL = g['ALL']
eset = biobase.ExpressionSet(ALL)
#--------------------------------------------

FH = open('Rcode.txt','r')
s = FH.read()
FH.close()
r(s)
compare_factors = r['compare.factors']
plot_eset = r['plot.eset']
#--------------------------------------------
targets = [ ['ALL1/AF4','NEG'],
['ALL1/AF4','BCR/ABL'],
['E2A/PBX1','BCR/ABL'] ]

def get_ofn(t):
#print t, len(t)
u,v = t
s = u + '_' + v
ofn = s.replace('/','-') + '.pdf'
return ofn

def plot(eset_sub,t):
ofn = get_ofn(t)
gd.pdf(ofn)
plot_eset(eset_sub,ofn)
gd.dev_off()

for t in targets:
eset_sub = compare_factors(eset,t)
plot(eset_sub,t)

The first plot is at the top, and the second and third are below. It's pretty clear that there are significant commonalities according to mol.biol. BTW, I adjusted the target p-value to keep the number of hits down.

I also added code to substitute the gene names in the plots. That's pretty cool.

There are problems with the alternatives. That's for another time.