Python for Bioinformatics: Hex puzzle (3)

I've been working on the hex puzzle (previous posts here and here). We have a list of 340 unqiue lines of data, each 100 characters in length. The first and last lines have been given to us as i = 137 and i = 61, respectively.

The data file ('reddit_hex.txt') is on Dropbox here.

The data seem to be hexadecimal bytes with an unusual distribution that leads to the first part of the decoding: these are bytes that represent ASCII characters in the set A..Za..z0..1. This suggests the possibility that the given data are a hex representation of base64 (although the last two characters '/' and '+' were not found).

Unfortunately, I haven't yet been able to stitch the lines together in a way that generates a 1936 hex digit mesage, with the correct ends, to be decoded. Apparently I go wrong at some points in the assembly, probably because of repeats in the data stream, although perhaps I just messed up.

It might be a disappointment to you, but I'm going to try using another critical clue gleaned from the comments to the reddit post. After converting the base64 data to actual ASCII characters, it turns out that there are two intertwined messages: the alternating bytes form a C program and a PHP program. So I wrote a script that uses all these clues to analyze each of the lines of the original data.

Let's look at some sample output, say the first entry (i = 137):

> python demo.py 137
137 497a78705032357759326873634855675a43526c595341675044317a494852685a484a70636d39684c6e6c6f4b4434354369
    I z x p P 2 5 w Y 2 h s c H U g Z C R l Y S A g P D 1 z I H R h Z H J p c m 9 h L n l o K D 4 5 C i
    IzxpP25wY2hscHUgZCRlYSAgPD1zIHRhZHJpcm9hLnloKD45Ci
  0 ('#include <stdio.h>', '<?php $a = array(9')

What's going on here is that we have 100 hexadecimal digits or 50 bytes in the original data. Each byte can be converted to its ASCII representation as shown in the second and third lines.

>>> chr(int('0x49',base=16))
'I'

We can do the whole line at once with:

>>> import binascii
>>> s = '497a78705032357759326873634855675a43526c595341675044317a494852685a484a70636d39684c6e6c6f4b4434354369'
>>> binascii.unhexlify(s)
'IzxpP25wY2hscHUgZCRlYSAgPD1zIHRhZHJpcm9hLnloKD45Ci'

The base64 data decoding takes 4 characters at a time. For example:

>>> import base64
>>> d = base64.b64decode
>>> d('Izxp')
'#<i'
>>> s = 'IzxpP25wY2hscHUgZCRlYSAgPD1zIHRhZHJpcm9hLnloKD45Ci'
>>> d(s[:-2])
'#<i?npchlpu d$ea  <=s tadriroa.yh(>9'

Because base64 is a 4-letter code, there are 4 possible frames, and we only know the frame for the first line of data. It turns out that for this data, for each line, only one of the frames yields just ASCII characters. In the script below, we filter out decodings like this one:

'\xd0\xb0\x80\xc8\xb1\x80\xc0\xb0\x80\xd8\xb0\x80\xc4\xb1\x80\xd8\xb0\x81'

by turning the whole line into an array of ints and then asking whether any is > 128.

def filter(line):
    B = [struct.unpack('B',b)[0] for b in line]
    return not max(B) > 128

After conversion from base64 to ASCII, we build strings from the alternate bytes, resulting in the output line:

0 ('#include <stdio.h>', '<?php $a = array(9')

This indicates that the two strings were derived from frame 0. It's obvious that the first one is the beginning of a C program, and the second certainly looks like PHP to me.

Note that we can't just use the data as is, because we have thrown away bytes at the joints. Decoding in frame 0, we lose the 'Ci' at the end of the base64 data.

Also, consider i = 8:

> python demo.py 8
  8 357759326873634855675a43526c595341675044317a494852685a484a70636d39684c6e6c6f4b4434354369787049473432
    5 w Y 2 h s c H U g Z C R l Y S A g P D 1 z I H R h Z H J p c m 9 h L n l o K D 4 5 C i x p I G 4 2
  8 5wY2hscHUgZCRlYSAgPD1zIHRhZHJpcm9hLnloKD45CixpIG42
  2 ('clude <stdio.h>\nin', 'hp $a = array(9, 6')

Notice that 'clude <stdio.h>\nin' looks like a continuation of '#include <stdio.h>' from the first line.

>>> import base64
>>> d = base64.b64decode
>>> d('Y2hscHUg')
'chlpu '

The 'cl' of 'clude' are coming from the base64 'Y2hs'. It's reassuring that we can align 0 and 8:

IzxpP25wY2hscHUgZCRlYSAgPD1zIHRhZHJpcm9hLnloKD45Ci
      5wY2hscHUgZCRlYSAgPD1zIHRhZHJpcm9hLnloKD45CixpIG42

So the plan is, to stitch data lines together one at a time, and examine the result to be sure it makes sense after going through base64 decoding and de-interleaving.

demo.py

import sys, binascii, base64, struct

def load_data(fn):
    FH = open(fn,'r')
    data = FH.read().strip()
    FH.close()
    return data.strip().split('\n')

def untangle(s):
    s1 = ''.join([s[i] for i in range(0,len(s),2)])
    s2 = ''.join([s[i] for i in range(1,len(s),2)])
    return s1,s2

def filter(line):
    B = [struct.unpack('B',b)[0] for b in line]
    return not max(B) > 128

def process(s):
    d = base64.b64decode
    rL = list()
    for i,j in [(0,-2),(1,-1),(2,len(s)),(3,-3)]:
        line = d(s[i:j])
        if filter(line):
            rL.append((i,untangle(line)))
    return rL
    
def analyze_line(i,L):
    D = {'i':i}
    s = L[i]
    D['hex'] = s
    s = binascii.unhexlify(s)
    D['b64'] = s
    result = process(s)
    assert len(result) == 1
    j, t = result[0]
    D['frame'] = j
    D['t'] = t
    return D
    
def format_dict(D):
    s1 = '%3d ' % D['i'] + D['hex']
    s2 = '    ' + ' '.join(list(D['b64'])) 
    s3 = '%3d ' % D['i'] + D['b64']   
    s4 = '%3d ' % D['frame'] + str(D['t'])
    return '\n'.join([s1,s2,s3,s4])

L = load_data('reddit_hex.txt')
DL = [analyze_line(i,L) for i in range(len(L))]

try:
    i = int(sys.argv[1])
    text = format_dict(DL[i])
    print text, '\n'
except IndexError:
    for D in DL:
        text = format_dict(D)
        print text, '\n'

Sunday, April 15, 2012

Hex puzzle (3)