The next step is to stitch the lines together to form the complete text. If they contained random bytes, that would be relatively easy. There are 65536 possible two-byte combinations. We have a total of 340 * 25 = 8500 two-byte (4 hex character) combinations in our data, so the expected number of times any combination should appear in random data is much less than 1.
However, the data is not random, instead, it is highly skewed. Simple modifications to count.py (from last time) show a total of only 370 two-byte combinations in the data (i.e. 65166 of all possible values are not observed), and the top 20 have counts ranging from 464 down to 72.
464 4341 CA 331 4943 IC 309 4167 Ag 281 4c43 LC 234 734c sL 174 4173 As 154 4377 Cw 140 7773 ws 139 7349 sI 139 674d gM 139 674c gL 137 4c44 LD 126 674e gN 104 6749 gI 103 7767 wg 95 4e69 Ni 89 5973 Ys 78 4944 ID 76 4947 IG 72 4e6a Nj |
In the table above, the first column is the count, the second and third are the hex and ASCII representations.
That compares with a maximum count of 3-4 for random data, and an observed total of 8000 different values on an average run with random data.
This suggests that we should undertake some data exploration to see whether the repetitive character of the data will cause difficulties for the assembly.
The core operation is to take an individual entry at index i in the data (the query), and depending on a parameter (tlen) for the number of characters to choose at the very end of the query, make a substring sequence t, the target to search for. Then, we go through all the other entries looking for a match containing t, and when one is found we save the index of the query and the match as well as the position k where the target t begins in the match string.
The quality of these matches can be assessed in various ways. We expect that the sequence of each match string upstream of the target should be exactly the same as the query. We calculate the offset o where the beginning of the match string should be found in the query. Since this depends on tlen (o = len(L[i]) - k - n) we calculate it for each hit and save it with the other parameters. The offset is also useful for printing the actual alignments in show.
A second quality test is to examine the match strings downstream of the target. In the region of overlap, we expect they should be identical as well. A failure of either test would suggest that the data is causing trouble for our simple search algorithm.
As mentioned last time, the observed hex digits comprise all (and only) those which encode ASCII characters 0..1A..Za..z. Rather than print out strings of 100 characters, it will simplify the output to convert the data to ASCII first. We're assuming that this is what we need to do with the data, but I peeked at the answer on reddit, so it's safe to go ahead.
python convert.py > data.mod.txt
import binascii from utils import load_data L = load_data('reddit.txt') for line in L: print binascii.unhexlify(line) |
Note the extremely repetitive nature of the data by comparing some adjacent output lines:
A1NiAsNiAsMyAsMSA0NiwsICAyJyxmICcwLCwgIDc2KSw7ICAx A1NywsICAyNSwsICA0NiwsICAyJyxmICcwLCwgIDc2KSw7ICAx AgLDUgLDcgLDMgLDAgLDAgLDcgLDIgLDIgLDYgLDYgLDEgNTQs AgLDUgLDcgLDMgLDAgLDAgLDcgLDYgLDIgLDYgLDYgLDEgNTQs xpIG42dCwgIG00YSxpIG42KCwpICAzeywgICAyICwgICA2ICxp xpIG43dCwgIGE3WyxdICA3PSwgIHswNCwsICA2NCwsICAxNiws |
However, for a well-chosen query (one with a rare two-byte combination as the target) and a long enough target length, it's easy to find matches that appear correct. In the output below, we examine the string found at index 0 in the data as the query. The analysis finds 25 matches, data strings where the terminal 4 characters appear somewhere in the match, and all of those matches are perfect ones whether looking upstream and comparing against the query, or looking downstream in each one (beyond the target) and comparing with each other, as shown in the second half of the output.
We'll exercise these routines a bit more next time. I want to see how big to make the target size, and also whether we should exclude some strings from the preliminary analysis. But overall, it looks like we might be OK.
> python search.py i= 0 : 25 matches found 0NiwsICAyJyxmICcwLCwgIDc2KSw7ICAxJCxiICA2PSwgIGExc GExc NiwsICAyJyxmICcwLCwgIDc2KSw7ICAxJCxiICA2PSwgIGExc iwsICAyJyxmICcwLCwgIDc2KSw7ICAxJCxiICA2PSwgIGExc CAyJyxmICcwLCwgIDc2KSw7ICAxJCxiICA2PSwgIGExc AyJyxmICcwLCwgIDc2KSw7ICAxJCxiICA2PSwgIGExc xmICcwLCwgIDc2KSw7ICAxJCxiICA2PSwgIGExc mICcwLCwgIDc2KSw7ICAxJCxiICA2PSwgIGExc CcwLCwgIDc2KSw7ICAxJCxiICA2PSwgIGExc cwLCwgIDc2KSw7ICAxJCxiICA2PSwgIGExc wLCwgIDc2KSw7ICAxJCxiICA2PSwgIGExc LCwgIDc2KSw7ICAxJCxiICA2PSwgIGExc CwgIDc2KSw7ICAxJCxiICA2PSwgIGExc wgIDc2KSw7ICAxJCxiICA2PSwgIGExc gIDc2KSw7ICAxJCxiICA2PSwgIGExc CAxJCxiICA2PSwgIGExc AxJCxiICA2PSwgIGExc xJCxiICA2PSwgIGExc JCxiICA2PSwgIGExc CxiICA2PSwgIGExc xiICA2PSwgIGExc A2PSwgIGExc 2PSwgIGExc SwgIGExc gIGExc IGExc GExc no mismatches upstream GExc j jR jRyLGE jRyLGEg jRyLGEgeTYo jRyLGEgeTYoL jRyLGEgeTYoLDc jRyLGEgeTYoLDcg jRyLGEgeTYoLDcgL jRyLGEgeTYoLDcgLD jRyLGEgeTYoLDcgLDQ jRyLGEgeTYoLDcgLDQg jRyLGEgeTYoLDcgLDQgL jRyLGEgeTYoLDcgLDQgLDkgLDIgLDY jRyLGEgeTYoLDcgLDQgLDkgLDIgLDYg jRyLGEgeTYoLDcgLDQgLDkgLDIgLDYgL jRyLGEgeTYoLDcgLDQgLDkgLDIgLDYgLD jRyLGEgeTYoLDcgLDQgLDkgLDIgLDYgLDA jRyLGEgeTYoLDcgLDQgLDkgLDIgLDYgLDAg jRyLGEgeTYoLDcgLDQgLDkgLDIgLDYgLDAgLDUg jRyLGEgeTYoLDcgLDQgLDkgLDIgLDYgLDAgLDUgL jRyLGEgeTYoLDcgLDQgLDkgLDIgLDYgLDAgLDUgLDc jRyLGEgeTYoLDcgLDQgLDkgLDIgLDYgLDAgLDUgLDcgL jRyLGEgeTYoLDcgLDQgLDkgLDIgLDYgLDAgLDUgLDcgLD jRyLGEgeTYoLDcgLDQgLDkgLDIgLDYgLDAgLDUgLDcgLDM no mismatches downstream |
search.py
from utils import load_data # tlen is minimum overlap # functions deal with one string at a time # do not assume upstream matches # e is search string, f is potential match def one_search(i,L,tlen): # search result has associated tlen oL = list() e = L[i] t = e[-tlen:] for j,f in enumerate(L): if i == j or not t in f: continue # offset is pos in e where f should start # not guaranteed to match except over t offset = len(e) - f.index(t) - tlen oL.append((j,offset)) oL.sort(key=lambda item:item[1]) return oL # mmL is a list of mismatched strings # ummL contains upstream mismatches def test_upstream(i,L,oL,ummL): e = L[i] for j,o in oL: f = L[j] if not f.startswith(e[o:]): if i < j: ummL.append((i,j)) else: ummL.append((j,i)) def trim_downstream(oL,L,tlen): pL = list() for j,o in oL: f = L[j] k = len(f) - o pL.append((j,f[k:])) return pL def test_downstream(pL,dmmL): mismatches = 0 for i,s1 in pL: for j, s2 in pL: if s1 == s2 or len(s1) < len(s2): continue if not s1.startswith(s2): if i < j: dmmL.append((i,j)) else: dmmL.append((j,i)) if __name__ == '__main__': L = load_data('data.mod.txt') tlen = 4 i = 0 e = L[i] oL = one_search(i,L,tlen) print 'i=', i, ':', len(oL), 'matches found' print e print ' ' * (len(e)-tlen) + e[-tlen:] print for j,o in oL: k = len(e) - o print ' '*o + L[j][:k] ummL = list() test_upstream(i,L,oL,ummL) if ummL: print len(ummL), 'mismatches' else: print 'no mismatches upstream\n' dmmL = list() print e[-tlen:] pL = trim_downstream(oL,L,tlen) for item in pL: print ' '*tlen + item[1] test_downstream(pL,dmmL) if dmmL: print len(dmmL), 'mismatches' else: print 'no mismatches downstream' |