Python for Bioinformatics: Dissecting RSA keys in Python (3)

I'm working on understanding the structure of RSA keys as output by the ssh-keygen utility (previous posts here and here). At the risk of becoming redundant, I want to try to simplify what we did previously. Let's begin by rehashing old material on bytes in Python:

It's easy to get confused between a byte and its string representation, or for that matter, between an unsigned int and its string representation. Although hexadecimal is natural too, it's perhaps easiest to think of an individual byte in terms of the decimal equivalent (0..255). That's because a Python int converts easily to chr, bin, and hex.

>>> i = 35
>>> c = chr(i)
>>> c
'#'
>>> i == ord(c)
True
>>> b = bin(i)
>>> b
'0b100011'
>>> h = hex(i)
>>> h
'0x23'

Notice that hex and bin both give unpadded output:

>>> hex(1)
'0x1'
>>> bin(1)
'0b1'

Don't be fooled. h and b here are 'str' datatypes. But these string representations of bin and hex values go back to int easily (though we must specify the old base explicitly):

>>> int(b,2)
35
>>> int(h,16)
35

To interconvert hex and bin, it's easiest to go through int:

>>> bin(35)
'0b100011'
>>> hex(35)
'0x23'
>>> hex(int(bin(35),2))
'0x23'
>>> bin(int(hex(35),16))
'0b100011'

The struct module

This module performs conversions between Python values and C structs represented as Python strings. This can be used in handling binary data stored in files or from network connections, among other sources. It uses Format Strings as compact descriptions of the layout of the C structs and the intended conversion to/from Python values.

unpack works on the particular string representation of a byte in which it is a single character:

>>> i = 35
>>> c = chr(i)
>>> c
'#'
>>> unpack('B',c)
(35,)
>>> b = bin(i)
>>> unpack('B',b)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
struct.error: unpack requires a string argument of length 1
>>>

To construct a multi-byte value like a 4-byte unsigned int we can do this:

>>> h = '\x23'
>>> type(h)
<type 'str'>
>>> b_little = '\x23\x00\x00\x00'
>>> type(b_little)
<type 'str'>

With multi-byte objects (like ints) we get into endian-ness---'I' is the format for an unsigned int, and '>' means big-endian:

>>> unpack('I', b_little)
(35,)
>>> b_big = '\x00\x00\x00\x23'
>>> unpack('I', b_big)
(587202560,)
>>> unpack('>I', b_big)
(35,)

One can also use the binascii module, but this is more complicated. Let's write two bytes to a file and then read the binary data:

>>> fn = 'temp'
>>> FH = open(fn,'w')
>>> FH.write('##')
>>> FH.close()
>>> FH = open(fn,'rb')
>>> data = FH.read()
>>> FH.close()

This "data" is still a string:

>>> type(data)
<type 'str'>
>>> data
'##'
>>> len(data)
2
>>> b2a_hex(data[0])
'23'
>>> b2a_hex(data)
'2323'
>>> b2a_hex('#Az')
'23417a'

Or you can get fancy, but it seems unnecessary..

>>> import array
>>> L = array.array('B','#Az')
>>> type(L)
<type 'array.array'>
>>> L[0]
35
>>> L[1]
65
>>> L
array('B', [35, 65, 122])
>>> b2a_hex(L)
'23417a'

With all this in mind, we can redo the script from yesterday to read the base64-encoded key data in a way that is simply understandable. The output first:

> python read.py 
dlen 7
dlen 1
dlen 129

209,8,39,27 .. 169,97,145,127
0xd1,0x8,0x27,0x1b .. 0xa9,0x61,0x91,0x7f 

1467871546 .. 1722964351

We decode the data and break it into 3 parts exactly as before. Looking at the third part, we unpack it byte by byte. The result of unpacking is a list of ints. We convert that list directly into the large number n (by doing b * 256**i for each byte, moving along the list in reverse order), or we can convert the ints into hex values, assemble that list into one long hex string, and then call eval as before. It seems perfectly transparent now.

It even seems clear (in retrospect) why there is an additional null byte after the int that specifies the size of the third segment. It is probably to align the base64 encoding to begin with the first data byte.

The only remaining difficulty is that this approach fails for the private key. So that's still a mystery.

read.py

import base64
from struct import unpack

FH = open('data.txt','r')
data = FH.read()
FH.close()
b64_data = ''.join(data.strip().split())
data = base64.b64decode(b64_data)

L = list()
while data:
    dlen = unpack('>I',data[:4])[0]
    L.append(data[4:dlen+4])
    data = data[dlen+4:]
    print 'dlen', dlen
print

# let's look at n
bL = L[2]
bL = bL[1:]    # extra null first, why?
iL = [unpack('B',b)[0] for b in bL]
hL = [hex(i) for i in iL]

# look at the int and hex values
pL = [str(n) for n in iL]
print ','.join(pL[:4]), '..', ','.join(pL[-4:])
print ','.join(hL[:4]), '..',
print ','.join(hL[-4:]), '\n'

# first just do the computation ourselves
m = 256
iL.reverse()
n = iL[0]
for i in iL[1:]:
    n += i*m
    m *= 256
    
s = str(n)
print s[:10], '..', s[-10:]
    
# hex doesn't pad the output
# remove the '0x' and add the extra '0' if needed
for i in range(len(hL)):
    b = hL[i]
    if len(b) < 4:
        hL[i] = '0' + b[-1]
    else:
        hL[i] = b[-2:]  
h = '0x' + ''.join(hL)
nh = eval(h)
assert n == nh

Wednesday, August 31, 2011

Dissecting RSA keys in Python (3)