Python for Bioinformatics: Hex puzzle (1)

Monday, April 9, 2012

Hex puzzle (1)

Last Saturday, an interesting puzzle was posted on reddit/r/programming. I'm having trouble finding the original post, but the actual puzzle page is here, and the (complete) comments to the post are here.

As you can see from the screenshot or the link

the puzzle is a message consisting of 1936 hex digits, and upon further manipulation it promises to yield the answer. You can actually follow the link to see the answer if you're impatient. I wasn't able to compete (I had a party to cook for, and could not have gone as fast as these guys anyway), but I'd still like to understand how they got there.

Apparently, we need to assemble a number of overlapping 100 digit pieces of a 1936 digit message and then figure out what it means. The pieces were served up to different visitors to the site and collected as comments on reddit. The first objective is to recover the observed data. I believe reddit must have an API for this, but I don't know it. What I did was to save the source for the page with all 410 comments (in the file comments.txt) and then run scrape.py below.

I count 251 strings of 100 characters, all in "0..9a..f".

According to the instructions, the first and last strings are marked, and these were recovered and posted on reddit:

497a78705032357759326873634855675a43526c595341675044317a494852685a484a70636d39684c6e6c6f4b4434354369 //first 100 4270494630674f3349675a5831304948566c636d4e75614342764d4341374a79426c494363674f7941674944393950676f3d //last 100

[UPDATE: There is a link to the complete (known) data from reddit to github here. I count a total of 340 strings, substantially more than what I've been working with in early attempts. This should help with the assembly problem. ]

The data clearly has substantial structure, as indicated by the screenshot from a "find" search for the digit "4" in TextEdit.

Since only hexadecimal digits are present it suggests that this might actually be hexadecimal. It's interesting that not all digits are equally represented. In particular, only those bytes (two hex digits) which encode ASCII "a..zA..Z0..9" are present (although the distribution is decidedly odd). So it seems pretty likely that we just need to decode the hex to ASCII and then figure out what it means.

However, there is an assembly problem that comes first.

Output from count.py (listing at the end):

> python count.py 
  54  30   48  0  0
  40  31   49  1  1
 256  32   50  2  2
  71  33   51  3  3
  58  34   52  4  4
  82  35   53  5  5
  64  37   55  7  7
  39  38   56  8  8
  63  39   57  9  9
1070  41   65  A  A
 133  42   66  B  B
1102  43   67  C  C
 414  44   68  D  D
 111  45   69  E  E
  38  46   70  F  F
 169  47   71  G  G
 101  48   72  H  H
 812  49   73  I  I
 151  4a   74  J  J
 132  4b   75  K  K
 667  4c   76  L  L
 399  4d   77  M  M
 532  4e   78  N  N
  83  4f   79  O  O
  82  50   80  P  P
  67  51   81  Q  Q
  65  52   82  R  R
 286  53   83  S  S
  58  54   84  T  T
  51  55   85  U  U
  12  56   86  V  V
  56  57   87  W  W
  31  58   88  X  X
 315  59   89  Y  Y
  92  5a   90  Z  Z
  16  61   97  a  a
  13  62   98  b  b
 278  63   99  c  c
  61  64  100  d  d
  58  65  101  e  e
   8  66  102  f  f
 938  67  103  g  g
  57  68  104  h  h
 247  69  105  i  i
 201  6a  106  j  j
 116  6b  107  k  k
  62  6c  108  l  l
  64  6d  109  m  m
  97  6e  110  n  n
  61  6f  111  o  o
 209  70  112  p  p
 896  73  115  s  s
  25  74  116  t  t
  23  75  117  u  u
  15  76  118  v  v
 532  77  119  w  w
 211  78  120  x  x
 414  79  121  y  y
 192  7a  122  z  z

scrape.py

# python scrape.py > data.txt
fn = 'comments.txt'
FH = open(fn)
data = FH.read()
FH.close()

L = list()
i = 0

while True:
    j = data.find('>',i)
    j += 1
    i=j
    k = j + 100
    if k > len(data):
        break
    if not data[k] == '<':
        continue
    w = data[j:k]
    if ' ' in w or '>' in w or '<' in w:  
        continue
    for c in w:
        assert c in '0123456789abcdef'
    L.append(w)
    if i == 0:
        break

L = list(set(L))
L.sort()

for w in L:
    print w

count.py

import binascii
from collections import Counter

def load_data():
    fn = 'data.txt'
    FH = open(fn,'r')
    data = FH.read().strip()
    FH.close()
    L = data.split('\n')
    return L

L = load_data()
n = 2
cL = list()
for line in L:
    for i in range(0,len(line)-n+1,2):
        cL.append(line[i:i+n])

C = Counter(cL)
for h,n in sorted(C.most_common()):
    i = int(h, base=16)
    print '%4d  %s %4d  %s  %s' %\
    (n, h, i, chr(i), binascii.unhexlify(h))