Python for Bioinformatics: Hex puzzle (2)

Let's continue with the puzzle introduced in the last post (here).

The next step is to stitch the lines together to form the complete text. If they contained random bytes, that would be relatively easy. There are 65536 possible two-byte combinations. We have a total of 340 * 25 = 8500 two-byte (4 hex character) combinations in our data, so the expected number of times any combination should appear in random data is much less than 1.

However, the data is not random, instead, it is highly skewed. Simple modifications to count.py (from last time) show a total of only 370 two-byte combinations in the data (i.e. 65166 of all possible values are not observed), and the top 20 have counts ranging from 464 down to 72.

464  4341  CA
 331  4943  IC
 309  4167  Ag
 281  4c43  LC
 234  734c  sL
 174  4173  As
 154  4377  Cw
 140  7773  ws
 139  7349  sI
 139  674d  gM
 139  674c  gL
 137  4c44  LD
 126  674e  gN
 104  6749  gI
 103  7767  wg
  95  4e69  Ni
  89  5973  Ys
  78  4944  ID
  76  4947  IG
  72  4e6a  Nj

In the table above, the first column is the count, the second and third are the hex and ASCII representations.

That compares with a maximum count of 3-4 for random data, and an observed total of 8000 different values on an average run with random data.

This suggests that we should undertake some data exploration to see whether the repetitive character of the data will cause difficulties for the assembly.

The core operation is to take an individual entry at index i in the data (the query), and depending on a parameter (tlen) for the number of characters to choose at the very end of the query, make a substring sequence t, the target to search for. Then, we go through all the other entries looking for a match containing t, and when one is found we save the index of the query and the match as well as the position k where the target t begins in the match string.

The quality of these matches can be assessed in various ways. We expect that the sequence of each match string upstream of the target should be exactly the same as the query. We calculate the offset o where the beginning of the match string should be found in the query. Since this depends on tlen (o = len(L[i]) - k - n) we calculate it for each hit and save it with the other parameters. The offset is also useful for printing the actual alignments in show.

A second quality test is to examine the match strings downstream of the target. In the region of overlap, we expect they should be identical as well. A failure of either test would suggest that the data is causing trouble for our simple search algorithm.

As mentioned last time, the observed hex digits comprise all (and only) those which encode ASCII characters 0..1A..Za..z. Rather than print out strings of 100 characters, it will simplify the output to convert the data to ASCII first. We're assuming that this is what we need to do with the data, but I peeked at the answer on reddit, so it's safe to go ahead.

python convert.py > data.mod.txt

import binascii
from utils import load_data

L = load_data('reddit.txt')
for line in L:
    print binascii.unhexlify(line)

Note the extremely repetitive nature of the data by comparing some adjacent output lines:

A1NiAsNiAsMyAsMSA0NiwsICAyJyxmICcwLCwgIDc2KSw7ICAx
A1NywsICAyNSwsICA0NiwsICAyJyxmICcwLCwgIDc2KSw7ICAx

AgLDUgLDcgLDMgLDAgLDAgLDcgLDIgLDIgLDYgLDYgLDEgNTQs
AgLDUgLDcgLDMgLDAgLDAgLDcgLDYgLDIgLDYgLDYgLDEgNTQs

xpIG42dCwgIG00YSxpIG42KCwpICAzeywgICAyICwgICA2ICxp
xpIG43dCwgIGE3WyxdICA3PSwgIHswNCwsICA2NCwsICAxNiws

However, for a well-chosen query (one with a rare two-byte combination as the target) and a long enough target length, it's easy to find matches that appear correct. In the output below, we examine the string found at index 0 in the data as the query. The analysis finds 25 matches, data strings where the terminal 4 characters appear somewhere in the match, and all of those matches are perfect ones whether looking upstream and comparing against the query, or looking downstream in each one (beyond the target) and comparing with each other, as shown in the second half of the output.

We'll exercise these routines a bit more next time. I want to see how big to make the target size, and also whether we should exclude some strings from the preliminary analysis. But overall, it looks like we might be OK.

> python search.py
i= 0 : 25 matches found
0NiwsICAyJyxmICcwLCwgIDc2KSw7ICAxJCxiICA2PSwgIGExc
                                              GExc

 NiwsICAyJyxmICcwLCwgIDc2KSw7ICAxJCxiICA2PSwgIGExc
  iwsICAyJyxmICcwLCwgIDc2KSw7ICAxJCxiICA2PSwgIGExc
      CAyJyxmICcwLCwgIDc2KSw7ICAxJCxiICA2PSwgIGExc
       AyJyxmICcwLCwgIDc2KSw7ICAxJCxiICA2PSwgIGExc
           xmICcwLCwgIDc2KSw7ICAxJCxiICA2PSwgIGExc
            mICcwLCwgIDc2KSw7ICAxJCxiICA2PSwgIGExc
              CcwLCwgIDc2KSw7ICAxJCxiICA2PSwgIGExc
               cwLCwgIDc2KSw7ICAxJCxiICA2PSwgIGExc
                wLCwgIDc2KSw7ICAxJCxiICA2PSwgIGExc
                 LCwgIDc2KSw7ICAxJCxiICA2PSwgIGExc
                  CwgIDc2KSw7ICAxJCxiICA2PSwgIGExc
                   wgIDc2KSw7ICAxJCxiICA2PSwgIGExc
                    gIDc2KSw7ICAxJCxiICA2PSwgIGExc
                              CAxJCxiICA2PSwgIGExc
                               AxJCxiICA2PSwgIGExc
                                xJCxiICA2PSwgIGExc
                                 JCxiICA2PSwgIGExc
                                  CxiICA2PSwgIGExc
                                   xiICA2PSwgIGExc
                                       A2PSwgIGExc
                                        2PSwgIGExc
                                          SwgIGExc
                                            gIGExc
                                             IGExc
                                              GExc
no mismatches upstream

GExc
    j
    jR
    jRyLGE
    jRyLGEg
    jRyLGEgeTYo
    jRyLGEgeTYoL
    jRyLGEgeTYoLDc
    jRyLGEgeTYoLDcg
    jRyLGEgeTYoLDcgL
    jRyLGEgeTYoLDcgLD
    jRyLGEgeTYoLDcgLDQ
    jRyLGEgeTYoLDcgLDQg
    jRyLGEgeTYoLDcgLDQgL
    jRyLGEgeTYoLDcgLDQgLDkgLDIgLDY
    jRyLGEgeTYoLDcgLDQgLDkgLDIgLDYg
    jRyLGEgeTYoLDcgLDQgLDkgLDIgLDYgL
    jRyLGEgeTYoLDcgLDQgLDkgLDIgLDYgLD
    jRyLGEgeTYoLDcgLDQgLDkgLDIgLDYgLDA
    jRyLGEgeTYoLDcgLDQgLDkgLDIgLDYgLDAg
    jRyLGEgeTYoLDcgLDQgLDkgLDIgLDYgLDAgLDUg
    jRyLGEgeTYoLDcgLDQgLDkgLDIgLDYgLDAgLDUgL
    jRyLGEgeTYoLDcgLDQgLDkgLDIgLDYgLDAgLDUgLDc
    jRyLGEgeTYoLDcgLDQgLDkgLDIgLDYgLDAgLDUgLDcgL
    jRyLGEgeTYoLDcgLDQgLDkgLDIgLDYgLDAgLDUgLDcgLD
    jRyLGEgeTYoLDcgLDQgLDkgLDIgLDYgLDAgLDUgLDcgLDM
no mismatches downstream

search.py

from utils import load_data

# tlen is minimum overlap
# functions deal with one string at a time
# do not assume upstream matches
# e is search string, f is potential match

def one_search(i,L,tlen):
    # search result has associated tlen
    oL = list()
    e = L[i]
    t = e[-tlen:]
    for j,f in enumerate(L):
        if i == j or not t in f:
            continue
        # offset is pos in e where f should start
        # not guaranteed to match except over t
        offset = len(e) - f.index(t) - tlen
        oL.append((j,offset))
    oL.sort(key=lambda item:item[1])
    return oL

# mmL is a list of mismatched strings
# ummL contains upstream mismatches
def test_upstream(i,L,oL,ummL):
    e = L[i]
    for j,o in oL:
        f = L[j]
        if not f.startswith(e[o:]):
            if i < j:
                ummL.append((i,j))
            else:
                ummL.append((j,i))
    
def trim_downstream(oL,L,tlen):
    pL = list()
    for j,o in oL:
        f = L[j]
        k = len(f) - o
        pL.append((j,f[k:]))
    return pL
    
def test_downstream(pL,dmmL):
    mismatches = 0
    for i,s1 in pL:
        for j, s2 in pL:
            if s1 == s2 or len(s1) < len(s2):
                continue
            if not s1.startswith(s2):
                if i < j:
                    dmmL.append((i,j))
                else:
                    dmmL.append((j,i))
    
if __name__ == '__main__':             
    L = load_data('data.mod.txt')
    tlen = 4
    i = 0
    e = L[i]
    oL = one_search(i,L,tlen)
    
    print 'i=', i, ':', len(oL), 'matches found'
    print e
    print ' ' * (len(e)-tlen) + e[-tlen:]
    print
    for j,o in oL:
        k = len(e) - o
        print ' '*o + L[j][:k]
        
    ummL = list()
    test_upstream(i,L,oL,ummL)
    if ummL:
        print len(ummL), 'mismatches'
    else:
        print 'no mismatches upstream\n'

    dmmL = list()
    print e[-tlen:]
    pL = trim_downstream(oL,L,tlen)
    for item in pL:
        print ' '*tlen + item[1]
    test_downstream(pL,dmmL)
    if dmmL:
        print len(dmmL), 'mismatches'
    else:
        print 'no mismatches downstream'

Tuesday, April 10, 2012

Hex puzzle (2)