Python for Bioinformatics: Python bytes

Sunday, November 15, 2009

Python bytes

About ten days ago I posted about looking at the passwords stored in /etc/xgrid/agent/controller-password and similar files.

The code I put up is rather silly. It uses a hard-coded dict to translate bytes (in string rep) to hexadecimal. This is not worth spending too much time on, especially since Python 3 has a whole 'nother attitude about strings and bytes, but I thought I would at least show a simple and more correct (I hope) Python 2 approach to this issue.

So, of course we have bits and bytes on the machine, and strings exist only on-screen or paper. We can represent bits and bytes as integers, or as chars, and vice-versa. I'm sure everyone knows we can go from int to chr and back again:

>>> chr(78)
'N'
>>> ord('P')
80

My understanding is that we should view integers as the natural intermediate form for conversion of bits and bytes from base 2 to other bases.

>>> bin(15)
'0b1111'

In this representation the binary number 1111 is an int (15) or its string representation ('0b1111'). Python also has string reps for hexadecimal and octal:

>>> hex(15)
'0xf'
>>> oct(15)
'017'

We can go from binary or hex back to int, but we need to specify the base:

>>> int('0b1111',2)
15
>>> int('0xf',16)
15

We don't actually need the leading '0x' or '0':

>>> int('f',16)
15
>>> int('17',8)
15

So, the other day I should have just done:

>>> bin(int('0xf',16))
'0b1111'

When reading data from a file:

FH = open('script.py','rb')
data = FH.read(8)
FH.close()

print type(data)
print len(data)
print data

<type 'str'>
8
from bin

Although the file was opened in "binary" mode, the type actually read was <'str'>, and when the data are printed, it looks like a string. Nevertheless, the data do respond well to a function that operates on binary data and converts it to a hexadecimal string representation.

from binascii import *
L = [b2a_hex(b) for b in data]
print L
L = [int(h,16) for h in L]
print L
print [chr(i) for i in L]

['66', '72', '6f', '6d', '20', '62', '69', '6e']
[102, 114, 111, 109, 32, 98, 105, 110]
['f', 'r', 'o', 'm', ' ', 'b', 'i', 'n']

The result is rather different if we use the same function on the data as a whole:

L = b2a_hex(data)
print len(L)
for i in range(0,len(L),2):
   h = L[i:i+2]
   print h,
   print chr(int(h,16))

16
66726f6d2062696e
66 f
72 r
6f o
6d m
20 
62 b
69 i
6e n

In this case, the 8 bytes are converted to 16 hexadecimal characters, and to do the conversion to ints and chars we must read 2 char chunks of the hexadecimal.

Does that make sense?