Python for Bioinformatics: Dissecting RSA keys in Python

I want to look at RSA keys a little bit more in this post. Let's generate a public/private pair (1024 bits) using the SSH utility keygen (no passphrase):

> ssh-keygen -b 1024 -t rsa 
Generating public/private rsa key pair.
Enter file in which to save the key (/Users/telliott_admin/.ssh/id_rsa): id_rsa
Enter passphrase (empty for no passphrase): 
Enter same passphrase again: 
Your identification has been saved in id_rsa.
Your public key has been saved in id_rsa.pub.
The key fingerprint is:
96:49:5d:40:b1:90:de:99:c5:05:43:f2:92:45:69:67 telliott_admin@c-98-236-78-154.hsd1.wv.comcast.net
The key's randomart image is:
+--[ RSA 1024]----+
|        .o=*B+.  |
|        .o B=.E  |
|       ...==.o   |
|       ..o+.     |
|        S        |
|       .         |
|                 |
|                 |
|                 |
+-----------------+

[ To be honest, at this point, I'm not actually sure this is the randomart that goes with the example in this post.. but I think it could be :) ]

Start with the "public" key. The data in the file looks like this:

id_rsa.pub

ssh-rsa AAAAB3NzaC1yc2EAAAABIwAAAIEA0QgnG+LBQosxpbkJIU0eWV9Zf7L63fy3BXp6T9RQH84yISs8SMvFV4J1NYJQtNpGopNFlvzsuFnnxaRaM8ureFIQVa8k/dRzv3r7QyIuoGyNQLDk3K8Tse/55fjuIxJFEf5jvoJs3Z/jYdPHreg8wzNCB/uMkbCccYsVt6lhkX8= my_identifier

Where my_identifier is

telliott_admin@c-myipaddress_with_dashes.hsd1.wv.comcast.net

The common representation of RSA keys (like we see in the file above) uses base64 format. I was confused about this at first, but found a great explanation here, and of course in wikipedia:

The base64, base32, and base16 encodings convert 8 bit bytes to values with 6, 5, or 4 bits of useful data per byte, allowing non-ASCII bytes to be encoded as ASCII characters for transmission over protocols that require plain ASCII, such as SMTP. The base values correspond to the length of the alphabet used in each encoding. There are also URL-safe variations of the original encodings that use slightly different results.

Consider the byte 1001-1110

We can encode this byte using hexadecimal representation as '9e'

>>> b = 0b10011110
>>> b
158
>>> hex(b)
'0x9e'

I didn't actually know you could do this, i.e. using 0b with no quotes (see SO).

The encoding of those 8 bits (one byte) as ASCII characters takes 2 bytes. The '9' and the 'e' take up that much space. An advantage of this representation of the byte is that it's easy to read, also the information can be transmitted to devices that are expecting an ASCII string, where the highest-value bit should be unset.

A disadvantage is that it doubles the size of the string that we're transmitting. As the quote indicates, the other encodings cram more bits of information into the byte to be transmitted. In base32 we could use (RFC4648) the characters 'A'..'Z' plus '2'..'7'.

In base64 we use uppercase, lowercase, digits plus '+' and '/', except that:

Using standard Base64 in URL requires encoding of '+', '/' and '=' characters into special percent-encoded hexadecimal sequences ('+' = '%2B', '/' = '%2F' and '=' = '%3D'), which makes the string unnecessarily longer.
For this reason, a modified Base64 for URL variant exists, where no padding '=' will be used, and the '+' and '/' characters of standard Base64 are respectively replaced by '-' and '_',

The wikipedia article has a great graphic showing an encoding in detail. We do conversions easily by using the Python base64 module.

>>> import base64
>>> e = base64.b64encode
>>> d = base64.b64decode
>>> e('Man')
'TWFu'
>>> d('TWFu')
'Man'
>>> s = 'Fourscore, and seven years ago'
>>> len(s)
30
>>> c = e(s)
>>> c
'Rm91cnNjb3JlLCBhbmQgc2V2ZW4geWVhcnMgYWdv'
>>> d(c)
'Fourscore, and seven years ago'

The encoding process looks at each sequence of 24 bits in the input (3 bytes) and encodes those same 24 bits spread over 4 bytes in the output. The last two characters [in the next example], the ==, are padding because the number of bits in the original string was not evenly divisible by 24 in this example.

>>> e('Ma')
'TWE='

We break this two byte (16 bit) value up into 6 + 6 + 6 = 18 bits---that makes three base64 values, and the third one is now E rather than F because the bit pattern is 000100, where the last two 00 are padding. Additional adding is added to bring us back to the byte boundary.

In Lincoln's famous phrase, there are 30 bytes, which is evenly divisible by 6, hence no padding is needed.

The second part of the problem was solved by a question I found on SO

Here is my version of the code:

import base64
import struct

data = open('id_rsa.pub').read().split(None)[1]
print data[:24]
keydata = base64.b64decode(data)
print keydata[:18]

parts = []
while keydata:
    dlen = struct.unpack('>I',keydata[:4])[0]
    data, keydata = keydata[4:dlen+4], keydata[4+dlen:]
    parts.append(data)

def f(x):
    return struct.unpack('B', x)[0]
    
e = eval('0x' + ''.join(['%02X' % f(x) for x in parts[1]]))
print 'e', e

n = eval('0x' + ''.join(['%02X' % f(x) for x in parts[2]]))
print 'n', n

I have to confess I haven't sorted this out completely yet. Nevertheless, it allows us to convert the RSA public key from ssh-keygen to an integer:

Output:

> python script.py
AAAAB3NzaC1yc2EAAAABIwAA
ssh-rsa#
e 35
n 146787154640181728129549360865206451974125178122992752499327844555082827713683060312609720092642289502088305950292885968241446393671268243539585900329449529436170858131743078225065287541399805133121852823856839448386268013284889428740476278604926908637093742329548018673369795976229837442944484597101722964351

I want to finish by showing that these values are actually correct, and tie the whole thing back to an earlier discussion of the mathematics of public key cryptography.

I found a module for Python that handles RSA cryptography.

One very nice thing about it is that it's pure Python. It might not be fast, but there's nothing to do to build it. (I used easy_install). It turns out that the rsa module can read the private RSA key that we got earlier (top of the post), but not the public key (which is why I learned everything that's come before this point):

> python
Python 2.7.1 (r271:86832, Jun 16 2011, 16:59:05) 
[GCC 4.2.1 (Based on Apple Inc. build 5658) (LLVM build 2335.15.00)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import rsa
>>> with open('id_rsa') as f:  data = f.read()
... 
>>> data.split()[4][:50]
'MIICWgIBAAKBgQDRCCcb4sFCizGluQkhTR5ZX1l/svrd/LcFen'
>>>
>>> k = rsa.PrivateKey.load_pkcs1(data)
>>> k.n
146787154640181728129549360865206451974125178122992752499327844555082827713683060312609720092642289502088305950292885968241446393671268243539585900329449529436170858131743078225065287541399805133121852823856839448386268013284889428740476278604926908637093742329548018673369795976229837442944484597101722964351L
>>> k.e
35
>>> k.d
71296617968088267948638260991671705244575086516882194071102095926754516318074629294696149759283397758157177175856544613145845391211758861147798865874304045384903321200809536161346504709634159813307741235395231601606901597492591281205810328850837168587984349087962868939575631849929827946396907884195208774355L
>>> k.p
12214944519618901215070692893221534580187337690484280451921546594832793966376942655508370083135703324858458340368238021755353370324360760471666368429071147L
>>> k.q
12017013618393528112233953119917677248859562342149056793052943111068006176296374093622600767249370271694572460567324457033873003984957249202580919628769533L
>>> k.n == k.p * k.q
True
>>>

You can see that n equals p * q and that it matches what we eventually obtained from the public key file (id_rsa.pub) above. That's pretty cool.

Still to do: use the keys to do encryption and decryption and check that the math is right. Figure out how the hash/digest is done. And figure out how the passphrase is used to modify the private key file. We can even add a passphrase at this point to the old version:

ssh-keygen -f id_rsa -p

I used 'abcde' as a passphrase (weak!):

> cat id_rsa
-----BEGIN RSA PRIVATE KEY-----
Proc-Type: 4,ENCRYPTED
DEK-Info: AES-128-CBC,FB4D3BD371510C4B37552EF6E949CC48

E+KtbKVAwE4Dxj3vKYgfHpbgqF9Nk6O4vJ+C6wogXEHXr8vkM9SF5pz6lKNfGRbC
S93/lurvq+1+rCuldDNmEjdy36cbgrWJ1CIwB6fq5VyT3+v+AxDMqV9nQkvHjnxR
/ON2n+CjOSklnvmaBk3l8+IvCIKar2hGS/1qHqJRKJ3B1fgd5I8i9yH6hx8yl8Qm
AfVPgBJ4+91dCW2Y8odtXENgjstGtWZkYd0O+EqPuSuPHLu+J7M2KSQNe8aKOhGf
DXCEwR4aGW8EXfRNQpULbi5JruYebYlg3JILm2MaFwy+OW39q1VElvBc5wD01awW
BF7Y9x++bi9krX/G49NzkLdSVshkDMSP+eRZ4sncWvDCrkN2xlWqnNfQ06I+XCVW
FMUmyMFI87eNmjcZOgIi1S+fQtWcbl4o9Hx1z6xcMd7WxOnfIxe5zHIu8C24WONW
spcVxSsHAA57jYa8j292prtK6HDDa6UOAa32hBz1lvC8fMWPxwluTJ8eyeeK/7O6
hNI5G9gM4MsMEMau8O49fEm0RuiT/aB8IXZeEVQNbV/FdixyBBhY53DMnFQ5wau8
OJCpL51S5lj2o8eAJc61SHwX/Fc9uF+rmAFy08tthLuCF2wGdZWsE7IJ8NQzFJA+
Us2dTCjY57V/QoCC41rWh+YP+srcEK1Wp7DfvXcq98N1/+tECyCe/aDdPCIM/8QD
glKrx8Ah7Tsx6of3vCHiYG0q8Yu68NNJOiWrnGMEmP0Gq9yhLYqKrc8a3oIP0Lst
Ss+0p989df73GdM3nyqtLUwAzaCJQnDivLFT6gltEnM=
-----END RSA PRIVATE KEY-----

Monday, August 29, 2011

Dissecting RSA keys in Python