Python for Bioinformatics: April 2012

Monday, April 30, 2012

KS test (2)

A bit more about the Kolmogorov-Smirnov test after the first post here. I found a pdf at MIT ocw about this. I don't understand much of it but it's clear that using the absolute value of the difference to get the statistic is correct.

Now we need to figure out the distribution of the statistic under the null hypothesis, so we can set our type I error at 0.05. From my perspective, the way to do this is to do a simulation in Python. The code below does the following N times: grab n=20 numbers from the standard normal and sort them, compute the D statistic as we did in the previous post, and accumulate those values in a list.

We sort the list, then find the place in the list where the statistic for the test case is found. (I used a function from the bisect module for this, but the naive method gives the same result).

I get a p-value of 0.25. This does not match what we had before (0.36), and I am not sure why that is.

To do a little more checking, I found an online reference that gives p-values for various statistic,n combinations.

for n = 20

p-value   D
0.20   0.210
0.15   0.246
0.10   0.264
0.05   0.294
0.01   0.356

According to this, a statistic of 0.20 (what we had) gives a p-value a bit larger than 0.20, but not much. So this source doesn't agree with what we got from R and Python. Maybe the online source is for a one-sided test (it doesn't state clearly), that would give a smaller p-value for a given statistic.

I added calculations for our simulation of d given various p, and vice-versa. The output:

> python script.py
d = 0.199 p = 0.25
d = 0.246 p = 0.10
d = 0.264 p = 0.06
d = 0.294 p = 0.03

p = 0.15 d = 0.226
p = 0.10 d = 0.245
p = 0.05 d = 0.275
p = 0.01 d = 0.336

Not sure what the correct answer is here, but I can't find a problem with the code and I feel like it's time to move on. One last bit is something to think about that is quoted here:

The Kolmogorov-Smirnov test is based on a simple way to quantify the discrepancy between the observed and expected distributions. It turns out, however, that it is too simple, and doesn't do a good job of discriminating whether or not your data was sampled from a Gaussian distribution. An expert on normality tests, R.B. D’Agostino, makes a very strong statement: “The Kolmogorov-Smirnov test is only a historical curiosity. It should never be used.” (“Tests for Normal Distribution” in Goodness-of-fit Techniques, Marcel Decker, 1986).

Take that, Andrey!

from bisect import bisect_left as bl
from scipy.stats import norm
import matplotlib.pyplot as plt
import numpy as np
    
# normal cdf
rv = norm()

def calc_stat(xL,yL):
    xL.sort()
    dL = list()
    for x,y0 in zip(xL,yL):
        y1 = rv.cdf(x)
        dL.append(abs(y1-y0))
    return max(dL)

def p_value(d,L):
    #for i,e in enumerate(L):
        #if d < e:
            #break
    i = bl(L,d)
    p = 1 - ((i+1)*1.0/len(L))
    return p
#-----------------------------------
# simulation
N = 10000
n = 20
yL = range(1,n+1)
yL = [1.0*y/n for y in yL]  #replaces "steps"
pL = list()

for i in range(N):
    xL = np.random.normal(0,1,n)
    D = calc_stat(xL,yL)
    pL.append(D)

plt.hist(pL,bins=50)
plt.savefig('example.png')
#-----------------------------------
pL.sort()
for d in [0.199, 0.246, 0.264, 0.294]:
    p = p_value(d,pL)
    print 'd = %3.3f' % d,
    print 'p = %3.2f' % p

print
for p in [0.15, 0.10, 0.05, 0.01]:
    i = int(N*(1-p))
    d = pL[i]
    print 'p = %3.2f' % p,
    print 'd = %3.3f' % d

KS test

I ran into a question about using the Kolmogorov-Smirnov test for goodness-of-fit in R and Python. I'd heard of the KS test but didn't know what it was about so I decided to explore.

Wikipedia's explanation was opaque at first (not surprising), but I found nice ones here and here.

As the references say, the KS test can be used to decide if a sample comes from a population with a specific distribution. We calculate a statistic D, and if the statistic is large enough we reject the null hypothesis that the sample comes from the specified distribution.

What is the D statistic? I think it is just the maximum deviation between the observed cdf for the sample, compared to the distribution. That works for the script below.

The KS test is a non-parametric test, it doesn't depend on assumptions about the distribution (e.g. standard deviation) of the data.

For these tests, I needed to reinstall scipy. The previous difficulties were not observed this time. It was as simple as:

git clone git://github.com/scipy/scipy
cd scipy
python setup.py build
sudo python setup.py install

The script below grabs 20 numbers from the normal distribution (sd = 2.0) and compares them with the standard normal distribution (sd = 1). The script does a plot (screenshot above) and prints the results of the KS test:

> python script.py 
max 0.198
D = 0.20, p-value = 0.36

We are not able to reject the null hypothesis, even though the source of the numbers had sd=2 and the null hypothesis is that sd=1.

I did a quick check by copying the numbers

x=c(-5.05,-2.17,-1.66,-1.35,-0.98,-0.97,-0.84,-0.74,-0.67,-0.00,0.21,0.24,0.33,0.67,0.94,1.63,1.65,2.05,2.22,3.19)

and pasting them into R. Then:

> ks.test(x,pnorm,0,1)

 One-sample Kolmogorov-Smirnov test

data:  x 
D = 0.1986, p-value = 0.3611
alternative hypothesis: two-sided

Which matches. I also wrote a slightly hackish test of the test in R:

f <- function () {
 c = 0 
 for (i in 1:100) {
 y1 = rnorm(n,mean=0,sd=1)
 result = ks.test(y1,"pnorm",0,1)
 if (result$p.value < 0.05) {
  print (result$statistic)
  print (result$p.value)
  }
 }
}

This function spews out the values of the parameters if the p-value is < 0.05. We're testing against the same distribution from which the samples were drawn (i.e. the null hypothesis is correct), so we expect that about 5 of the 100 tries will have a test statistic so extreme that we're fooled into thinking the null hypothesis is wrong. Output:

> n = 10
> f()
       D 
0.454372 
[1] 0.020945
        D 
0.4113619 
[1] 0.04811348
        D 
0.4722679 
[1] 0.01440302
      D 
0.43666 
[1] 0.02984022
        D 
0.5340613 
[1] 0.003428069
        D 
0.4336947 
[1] 0.03161219

Looks good to me.

script.py

import matplotlib.pyplot as plt
from scipy.stats import uniform, norm, kstest
import numpy as np

np.random.seed(1773)

def steps(L):
    delta = 1.0/len(L)
    y = 0
    rL = list()
    for x in L:
        y += delta
        rL.append(y)
    return rL

xL = np.random.normal(0,2.0,20)
xL.sort()
yL = steps(xL)

#----------------------------------

# points
plt.scatter(xL, yL, s=50, zorder=1)

# distribution
rv = norm()
pL = np.linspace(-3, 3)
plt.plot(pL, rv.cdf(pL), '--')

# vertical lines
dL = list()
for x,y0 in zip(xL,yL):
    y1 = rv.cdf(x)
    dL.append(abs(y1-y0))
    plt.plot((x,x),(y0,y1),
        color='r', zorder=0)

ax = plt.axes()
ax.set_xlim(-4, 4)
ax.set_ylim(-0.1, 1.1)
plt.savefig('example.png')

print 'max %3.3f' % max(dL)
t = kstest(xL,'norm')
print 'D = %3.2f, p-value = %3.2f' % t

xL = ['%3.2f' % n for n in xL]
print ','.join(xL)

Saturday, April 28, 2012

Build R

I downloaded the latest version of R this morning and played a bit.

Just for fun, I decided to see if I could build it from source. I ran into only one hiccup, related to gfortran. The gfortran compiler is a new one I got the other day to build Julia.

> gfortran -v
..
Target: x86_64-apple-darwin11
..
gcc version 4.6.2 20111019 (prerelease) (GCC)

Here are the build instructions for R:

# download R
curl -O http://cran.r-project.org/src/base/R-2/R-2.15.0.tar.gz
# unpack R
tar fxz R-2.15.0.tar.gz
# choose the architecture (x86_64, i386, ppc or ppc64)
arch=x86_64
# create a build directory
mkdir R-$arch
cd R-$arch
# configure R
../R-2.15.0/configure r_arch=$arch  CC="gcc -arch $arch" \
  CXX="g++ -arch $arch" F77="gfortran -arch $arch" \
  FC="gfortran -arch $arch" OBJC="gcc -arch $arch" \
  --x-includes=/usr/X11/include --x-libraries=/usr/X11/lib \
  --with-blas='-framework vecLib' --with-lapack
# build R
make -j8
# At this point you can run R as `bin/R' even without installing

# install R
make install
# in some cases you may need to use ``sudo make install'' instead

The configure step fails with a complaint about the options to gfortran. The config.log file reads:

..
configure:7007: result: defining F77 to be gfortran -arch x86_64
configure:7179: checking for Fortran 77 compiler version
configure:7188: gfortran -arch x86_64 --version >&5
GNU Fortran (GCC) 4.6.2 20111019 (prerelease)
Copyright (C) 2011 Free Software Foundation, Inc.

GNU Fortran comes with NO WARRANTY, to the extent permitted by law.
You may redistribute copies of GNU Fortran
under the terms of the GNU General Public License.
For more information about these matters, see the file named COPYING

configure:7199: $? = 0
configure:7188: gfortran -arch x86_64 -v >&5
Driving: gfortran -mmacosx-version-min=10.7.3 -arch x86_64 -v -l gfortran -shared-libgcc
gfortran: error: x86_64: No such file or directory
Using built-in specs.
COLLECT_GCC=gfortran
COLLECT_LTO_WRAPPER=/usr/local/gfortran/libexec/gcc/x86_64-apple-darwin11/4.6.2/lto-wrapper
gfortran: error: unrecognized option '-arch'

Removing "-arch $arch" from F77 and FC above, fixed it. R builds pretty quickly, then this works:

It even gives a Quartz window from which I can save the figure.

I don't have any real reason to do this, just thought I'd try it. Also, an important question is whether shared libraries like /usr/X11/lib/libpng.dylib are linked in properly. This works (but I'm not sure where it's getting libpng from):

> png('~/Desktop/x.png')
> plot(1:10)
> dev.off()
null device 
          1 
>

Friday, April 27, 2012

Python mergesort

I need to make a serious commitment to algorithms at some point. I have the 3rd edition of Sedgewick, and worked through volume 1 some years ago. I see the 4th edition has moved to Java; I think that's a big mistake. I would hope for a basic algorithms book in Python, but I'll take C++ over Java any day. (I also have Cormen but put it aside).

I should really buy Knuth's books, but worry they'd be beyond my abilities.

I also have Magnus Hetland's book and Zelle's also, and I think they're great, but haven't found the time to be systematic about either of them either. They are for this summer.

Just for fun, here is a modified version of mergesort (from my book).

There are a couple of ideas here. First, suppose we have two lists that are already sorted. It's easy to merge them into one sorted list, by comparing the next elements in each and choosing the correct one. That's what merge does. The second idea is recursion. That's what mergesort does. The code below is set up to read a power of ten from the input.

This is a good challenge for a beginning Python coder.

Here is some timing on mine:

> time python merge.py 3

real 0m0.054s
user 0m0.039s
sys 0m0.013s
> time python merge.py 4

real 0m0.232s
user 0m0.216s
sys 0m0.014s
> time python merge.py 5

real 0m8.201s
user 0m8.178s
sys 0m0.020s
>

It hangs with N = 1e6, I am not sure why yet.

merge.py

import random, sys

def merge(left,right):
    rL = list()
    while True:
        if left and right:
            if left[0] < right[0]:
                rL.append(left.pop(0))
            else:
                rL.append(right.pop(0))
            continue
        if left:
            rL.extend(left)
        elif right:
            rL.extend(right)
        return rL

def merge_sort(iL):
    if len(iL) == 1:
        return iL
    n = len(iL)/2
    left = merge_sort(iL[:n])
    right = merge_sort(iL[n:])
    result = merge(left,right)
    return result

if __name__ == '__main__':
    try:
        n = int(sys.argv[1])
    except:
        n = 4
    N = 10**n
    L = range(N)
    random.shuffle(L)
    result = merge_sort(L)
    assert result == sorted(L)

UTF-16 endian-ness

This is a brief continuation of the post on hexdump. I saved the same data in different files as UTF-16 in both big-endian and little-endian formats using TextMate.

> hexdump -C text16B.txt 
00000000  fe ff 00 54 00 68 00 69  00 73 00 20 00 69 00 73  |...T.h.i.s. .i.s|
00000010  00 20 00 73 00 6f 00 6d  00 65 00 20 00 74 00 65  |. .s.o.m.e. .t.e|
00000020  00 78 00 74 00 0a 00 2d  00 2d 00 e4 00 2d 00 2d  |.x.t...-.-...-.-|
00000030  00 3a 00 20 00 20 00 61  00 20 00 77 00 69 00 74  |.:. . .a. .w.i.t|
00000040  00 68 00 20 00 75 00 6d  00 6c 00 61 00 75 00 74  |.h. .u.m.l.a.u.t|
00000050  00 0a 00 2d 00 2d 22 1a  00 2d 00 2d 00 3a 00 20  |...-.-"..-.-.:. |
00000060  00 20 00 73 00 71 00 72  00 74                    |. .s.q.r.t|
0000006a


> hexdump -C text16L.txt 
00000000  ff fe 54 00 68 00 69 00  73 00 20 00 69 00 73 00  |..T.h.i.s. .i.s.|
00000010  20 00 73 00 6f 00 6d 00  65 00 20 00 74 00 65 00  | .s.o.m.e. .t.e.|
00000020  78 00 74 00 0a 00 2d 00  2d 00 e4 00 2d 00 2d 00  |x.t...-.-...-.-.|
00000030  3a 00 20 00 20 00 61 00  20 00 77 00 69 00 74 00  |:. . .a. .w.i.t.|
00000040  68 00 20 00 75 00 6d 00  6c 00 61 00 75 00 74 00  |h. .u.m.l.a.u.t.|
00000050  0a 00 2d 00 2d 00 1a 22  2d 00 2d 00 3a 00 20 00  |..-.-.."-.-.:. .|
00000060  20 00 73 00 71 00 72 00  74 00                    | .s.q.r.t.|
0000006a

The first two bytes are a byte order mark or BOM. You can see how that works.

Notice the change in our example code points from last time. The ä ('LATIN SMALL LETTER A WITH DIAERESIS'), which is c3 a4 in UTF-8, is 00 e4 in big-endian UTF-16 (and the reverse in little-endian)

The √ is e2 88 9a in UTF-8, and is 22 1a in big-endian UTF-16 and reversed as well in little-endian.

The most important point is that the bytes are completely different depending on the encoding (UTF-8 versus UTF-16), and a difference of encoding is usually the source of weird or unexpected stuff you see printed on the screen, when the encoding and decoding don't match.

I wish they had come up with some other words to describe the byte order than big- and little-endian. The reference to Gulliver is amusing, but a string of bytes has two ends. Which is supposed to be the big one? It's similar to the situation with genes, where people talk about the 5' and 3' ends, but of course each end of double-stranded DNA has both 5' and 3' ends!

Big-endian is described as the "most significant byte first." How about "highest value byte first" or "natural order" ?

hexdump

This is a quick look at hexdump, with a brief discussion of UTF encodings.

This is some text --ä--: a with umlaut --√--: sqrt

Paste that into a text editor (like TextMate) and save in UTF-8 encoding (default).

> hexdump -C text8.txt 
00000000  54 68 69 73 20 69 73 20  73 6f 6d 65 20 74 65 78  |This is some tex|
00000010  74 0a 2d 2d c3 a4 2d 2d  3a 20 20 61 20 77 69 74  |t.--..--:  a wit|
00000020  68 20 75 6d 6c 61 75 74  0a 2d 2d e2 88 9a 2d 2d  |h umlaut.--...--|
00000030  3a 20 20 73 71 72 74                              |:  sqrt|
00000037

[ Two other very useful flags are -n (number of bytes to read) and -s (offset, number of bytes to skip. Flags must preceed the filename. ]

The middle two columns are the bytes, and the last is the ASCII version for each, which we requested using the -C flag. Values at or above 128 (hex '0x80'), and I think the unprintable low-value ones, are shown as a "."

The output gives us a chance to practice hex arithmetic. For example, the third byte is '0x69' which is "i" and the fourth byte is '0x73' which is "s". That's because the hex characters between these two are:

'0x6a' '0x6b' '0x6c' '0x6d' '0x6e' '0x6f' '0x70' '0x71' '0x72'
"j"    "k"    "l"    "m"    "n"    "o"    "p"    "q"    "r"

The a with umlaut is also called 'LATIN SMALL LETTER A WITH DIAERESIS' (reference). There is a nice Python doc about Unicode, also wikipedia, where you can learn about the saga of Klingon.

The UTF-8 encoding of 'LATIN SMALL LETTER A WITH DIAERESIS' is

'0xc3' '0xa4'

and you can see these two bytes in the second line of the printout flanked by the repeated bytes '0x2d' for "--"

The way this works is that UTF-8 is a variable length encoding:

0zzzzzzz                    one-byte character
110yyyyy 10zzzzzz           two-byte character
1110xxxx 10yyyyyy 10zzzzzz  three-byte character

An ASCII text and its UTF-8 encoding would be identical. Let's grab the bits:

>>> bin(0xc3)
'0b11000011'
>>> bin(0xa4)
'0b10100100'

In bytes then, '0xc3' '0xa4' is:

11000011 10100100

Remove the "meta" bits (not sure what they're actual name is at the moment), leaving:

00011 100100

and padding out, we obtain:

00000000 11100100

which is:

00 e4

The two bytes above are the encoding in UTF-16, which we could just have obtained from the reference page.

This format is what's called big-endian.

The second non-ASCII character we have is the square-root symbol √ (encoded in this HTML page as ampersand-radic;)

In the output we have:

e2 88 9a

>>> bin(0xe2)
'0b11100010'
>>> bin(0x88)
'0b10001000'
>>> bin(0x9a)
'0b10011010'

In binary, that is:

11100010 10001000 10011010

Stripping out the meta info:

0010 001000 011010

No padding needed:

00100010 00011010

In hex:

22 1a

which matches the reference.

I'll have to put off discussion of UTF encodings in Python to another time.

Thursday, April 26, 2012

Basic find

This is to remind me of how to use the Unix command find. Possibly only I will find this useful. I'm just "indexing my head."

Make a test directory with one file:

> cd ~/Desktop
> mkdir x
> touch x/y.txt
> cd

Basic usage for find is: find dirname -name filename

> find . -name y.txt
./Desktop/x/y.txt
>

Don't forget the -name. Without it, find thinks file_name is just another place to search, since find can take multiple directories to search. "." is, of course, the current directory (I'm in my home directory "~"), while "/" would be the root of the file system.

Another use would be to count the number of files with wc:

> find . | wc -l
  158479
>

Of course, find can use wildcards.

> find ~/Desktop -name \*.txt
/Users/telliott/Desktop/notes.txt
/Users/telliott/Desktop/todo.txt
/Users/telliott/Desktop/x/y.txt

Notice the backslash to escape: \*

[ UPDATE: Can also be done with quotes "*.txt" ]

The above will also find hidden files. Other useful options for find include:

-type -mtime -mmin

For example, find all recently changed files:

> find ~/Desktop -mmin -30
/Users/telliott/Desktop
/Users/telliott/Desktop/.DS_Store
/Users/telliott/Desktop/.y.txt
/Users/telliott/Desktop/x
/Users/telliott/Desktop/x/y.txt

Do this with ~ to find where TextEdit saves all its data about the current state!

With appropriate caution, you might pipe the results to xargs to run a command:

> cd Desktop/
> find . -name \*.txt
./.y.txt
./notes.txt
./Rob Pike.txt
./todo.txt
./x/y.txt
> find . -name \*.txt | xargs /bin/ls -al
ls: ./Rob: No such file or directory
ls: Pike.txt: No such file or directory
-rw-r--r--  1 telliott  staff    0 Apr 26 06:23 ./.y.txt
-rw-r--r--@ 1 telliott  staff   29 Apr 23 15:54 ./notes.txt
-rw-r--r--@ 1 telliott  staff  815 Apr 24 17:19 ./todo.txt
-rw-r--r--  1 telliott  staff    0 Apr 26 06:00 ./x/y.txt

Notice what happened to the filename that contains a space. I have to read more to know how to fix that.

If you're searching the system without read permission for directories you'll get error messages:

> find / -name foo
find: /.DocumentRevisions-V100: Permission denied
find: /.fseventsd: Permission denied
find: /.Spotlight-V100: Permission denied
find: /.Trashes: Permission denied

They can be discarded like this:

> find / -name foo 2>/dev/null

This redirects stderr (the "2") to /dev/null, which is a Unix black hole.

Another sort of listing is obtained with du-disk usage, piped to sort and then to tail.

> du -sc * | sort -n | tail
1590728 My Downloads
3188512 Software
3392024 Dropbox
8421800 VirtualBox VMs
8509928 Library
8937696 Teaching
19995376 Pictures-TE
43919360 Music
236592936 Movies
336685360 total

Learn Unix in 10 minutes (quick reference)
A nice reference for find (and other stuff).

Wednesday, April 25, 2012

Julia

Julia is a new "high-level, high-performance dynamic programming language for technical computing." Website here and rationale here. Interview with Stefan Karpinski.

I followed the instructions here to download and build it. The only issue is that I failed to read the platform-specific notes before starting, so I had the wrong (old) gfortran. There are links in the docs, but I followed the notes in this thread and got it from (download link).

[ UPDATE: re the first comment, some discussion here. ]

Tuesday, April 24, 2012

big brother

As I mentioned in a previous post, I "upgraded" to iCloud from MobileMe basically in order to keep my .mac email address (which is actually now a .me address, but it forwards, or something). I'm not happy about iDisk going away, since I linked to so much content here on the blog, but that's the way it's going to be. I'm not going to fix all those links by hand to point to a new server. I still have the files, if anyone wants something.

I use Dropbox, and I love it. It's simple to choose which directories and files are shared, and I share everyday between work and home. I've also been grateful for the ability to roll back changes and recover deleted files, and also to share files in my Public folder.

Now comes iCloud.

How do you share selected folders to the cloud? As far as I know, you can't. There is a method explained by a Macworld editor in a post titled "Make Your Own Personal Dropbox With iCloud", and at the end he says:

Apple says that using this folder in the manner described above may lead to data loss

That doesn't sound like something I want to try. And then of course there's this:

I checked the box for "Documents & Data", and then unchecked it, to find this ominous alert. Note there's no way to say, please don't erase all that shared data from my Mac. Big brother knows best. I really have no explanation for it, but Apple have become even bigger a***oles than Microsoft were at their peak.

So it's a very welcome thing to see Google Drive announced today (article).

They would never erase my data for some stupid DRM-related reason. Would they?

defaultdict

A common use case for dictionaries is to hold a count of the occurrences of a key in some other object, for example, words in a file. In this case, one needs to deal with missing keys:

>>> L = list('abcaba')
>>> D = dict()
>>> for k in L:
...     try:
...         D[k] += 1
...     except KeyError:
...         D[k] = 1
... 
>>> D
{'a': 3, 'c': 1, 'b': 2}

Another solution is to use the setdefault method of dicts:

>>> D = dict()
>>> for k in L:
...     D.setdefault(k,0)
...     D[k] += 1
... 
0
0
0
1
1
2
>>> D
{'a': 3, 'c': 1, 'b': 2}

A better way (certainly better than this second approach) is to use a defaultdict:

>>> from collections import defaultdict
>>> D = defaultdict(int)
>>> for k in L:
...     D[k] += 1
... 
>>> D
defaultdict(, {'a': 3, 'c': 1, 'b': 2})

Some of the magic involves __missing__:

>>> 'd' in D
False
>>> D.__missing__('d')
0
>>> 'd' in D
True
>>> D
defaultdict(, {'a': 3, 'c': 1, 'b': 2, 'd': 0})

For more you can go to the docs or here or here, or just do this:

>>> print defaultdict.__missing__.__doc__
__missing__(key) # Called by __getitem__ for missing key; pseudo-code:
  if self.default_factory is None: raise KeyError((key,))
  self[key] = value = self.default_factory()
  return value

[ UPDATE: The collections.Counter class was mentioned in a comment. Don't know why I didn't mention it in this post, I've used it in many other posts here. If it's convenient to convert the data to a list of keys, that's the way to go. ]

A different use of cat

> cat > hello.py << EOF
> print "Hello, world!"
> EOF
> python hello.py
Hello, world!
> more hello.py
print "Hello, world!"
>

How does this work? According to the man page for cat, if

file is a single dash (`-') or absent, cat reads from the standard input.

The ">" redirects to a file as usual.

The "<<" is special. The whole construct is called a "here document." I found more about it including its proper name here, which leads to a wikipedia entry.

Note that any symbol not present by itself on a line in the desired text will work for termination. Here's a rather silly sequence in which it took me a while to enter the single character 'X' to stop input:

> cat > x.txt << X
> x
> cat > x.txt << XY
> cat > x.txt << XYZ
> print me
> XYZ
> cat > x.txt << X
> hi
> X
> more x.txt
x
cat > x.txt << XY
cat > x.txt << XYZ
print me
XYZ
cat > x.txt << X
hi
>

Quick Unix: a brief look at Unix stuff that I find interesting and not too obvious.

Monday, April 23, 2012

mtree on OS X

I came across a utility to test for changes in files (and more stuff, like permissions) in Chapter 4 of EdgeEnterprise Mac Security: mtree.

I'd never heard of it. There's a writeup about it here, as well as a couple of blog entries (here and here)

Rather than go straight to mtree (I still have to read the man page), this is a write-up of the bash script posted on the the first blog post. So it's really more about improving my bash scripting-fu (origin here) than it is about mtree.

> nl script.sh
1 #!/bin/sh

2 SCRIPT="`basename $0 | sed 's/\..*$//'`"

3 CONFIG=${1:-$HOME/.$SCRIPT}
4 DIR=${2:-$HOME/Documents}

5 cd $DIR || \
6        { echo "Failed to change working directory to $DIR" >&2; exit 1; }

7 if [ -f $CONFIG ]; then
8        /usr/sbin/mtree -f $CONFIG
9 fi

10 /usr/sbin/mtree -c -k time,md5digest,ripemd160digest > $CONFIG>

I added line numbers with nl, you could use cat or other methods.

I hope line 1 is obvious (see wikipedia if it's not).

In line two, we take the variable $0, which is the path to the script, obtain its basename, and then strip off the extension:

> basename /Users/telliott/script.sh
script.sh
> echo x.txt | sed 's/\..*$//'
x

I had an even harder time with lines 3 and 4, but the folks on StackOverflow helped me.

I learned that bash has a man page (of course!). Also that one can search within a man page by entering "/ :-" after you get the page, which initiates a search for ":-". The answer:

${parameter:-word} Use Default Values. If parameter is unset or null, the expansion of word is substituted. Otherwise, the value of parameter is sub- stituted.

So these two lines

3 CONFIG=${1:-$HOME/.$SCRIPT}
4 DIR=${2:-$HOME/Documents}

simply give default values for $CONFIG and $DIR if we don't supply them as arguments ($1 and $2) when invoking the script. If we called the script by ./script.sh from the Desktop, then we'd have $CONFIG be equal to ~/.script (a hidden file) and $DIR be equal to ~/Documents.

Lines 5 and 6 change the working directory to ~/Documents as default, if this fails (because the path input as an argument is no good) we write a message to >&2 and exit with 1 signifying failure. >&2 is stderr. This writes to the Terminal for me.

Lines 7 through 9 check whether $CONFIG exists already, if it does then we call mtree with the -f flag and $CONFIG argument:

/usr/sbin/mtree -f $CONFIG

From the mtree man page:

-f file
Read the specification from file, instead of from the standard
input.

If this option is specified twice, the two specifications are com-
pared with each other, rather than to the file hierarchy. The
specifications be sorted like output generated using -c. The out-
put format in this case is somewhat remniscent of comm(1), having
"in first spec only", "in second spec only", and "different" col-
umns, prefixed by zero, one and two TAB characters respectively.
Each entry in the "different" column occupies two lines, one from
each specification.

But the first time through, $CONFIG shouldn't exist yet. Instead, we call mtree with the -c and -k flags followed by various arguments, and write the result to $CONFIG. Let's do it in a simple way, as shown below. From the Desktop directory I input the command as shown, and write the results to x.txt in $HOME. As you can see, timestamps for all the Desktop files are there:

/usr/sbin/mtree -c -k time > x.txt
> cat ~/x.txt
#    user: telliott_admin
# machine: Toms-Mac-mini.local
#    tree: /Users/telliott_admin/Desktop
#    date: Mon Apr 23 17:20:21 2012

# .
/set type=file
.               type=dir time=1335216020.0
.DS_Store   time=1335216014.0
.localized  time=1333333575.0
notes.txt   time=1335210868.0
post.txt    time=1335216020.0
script.sh   time=1335215494.0
script.sh.txt \
time=1335211653.0
todo.txt    time=1335204803.0
..

>

Now, if I do it with -f, I get:

> /usr/sbin/mtree -f ~/x.txt
. changed
modification time expected Mon Apr 23 17:20:20 2012 found Mon Apr 23 17:20:51 2012
.DS_Store changed
modification time expected Mon Apr 23 17:20:14 2012 found Mon Apr 23 17:20:55 2012
post.txt changed
modification time expected Mon Apr 23 17:20:20 2012 found Mon Apr 23 17:20:51 2012

I see a change because I'm writing this text into post.txt! For some reason .DS_Store also changed. Next time I'll try to do something a little more useful.

Channeling Lincoln

I found this on reddit. The post says to consider the first part of the Gettysburg address:

Four score and seven years ago our fathers brought forth upon this continent a new nation, conceived in liberty and dedicated to the proposition that all men are created equal

How long do you think it would take you to manually break the address up into 13 character lines, breaking those lines at appropriate spaces?

I'm not so concerned with the estimation process. This is a fun problem for a beginning Python coder. My solution is below, but please try it yourself first. And yes, I was pleased with my time, but I've done this before. How to handle the beginning and the end is the key. I just make them special cases.

[ UPDATE: You could also just look for the right function in the standard library! ]

> python script.py 
10 Four score
 9 and seven
13 years ago our
 7 fathers
13 brought forth
 9 upon this
11 continent a
11 new nation,
12 conceived in
11 liberty and
12 dedicated to
 3 the
11 proposition
12 that all men
11 are created
 5 equal

FH = open('data.txt','r')
data = FH.read().strip().split()
FH.close()

pL = list()
N = 13
for word in data:
    assert len(word) <= N
tL = [data.pop(0)]

for word in data:
    n = len(' '.join(tL)) 
    n += 1 + len(word)
    if n > N:
        pL.append(' '.join(tL))
        tL = [word]
    else:
        tL.append(word)

pL.append(word)

for line in pL:
    print "%2d" % len(line), line

Threats on OS X

It is said that if you know your enemies and know yourself, you will not be imperiled in a hundred battles; if you do not know your enemies but do know yourself, you will win one and lose one; if you do not know your enemies nor yourself, you will be imperiled in every single battle.

--Sun Tzu

Like every other Mac user who is paying attention this week, I've been reading the stories about malware on OS X (e.g. Ars Technica about Flashback here and here).

Charles Edge's Enterprise Mac Security is a good place to start learning about this stuff (although the book's pub date is 2010 and in places it looks quite a bit older). It's also probably not for the average user.

Some terminology: malware is a comprehensive term for various strategies by which bad guys attack users' machines, including (quoting Edge):

Virus: code attached to a file, which can replicate itself and spread to other files on a computer. A simple virus does not spread over a network without transfer of infected files. [Wikipedia has a somewhat different take.]

Worms: spread across networks by taking advantage of security flaws. [An example is OSX.Leap.A.]

Trojan Horse: these embed themselves in an application and activate only when the app is opened. [The app itself is bad, it disguises its actions from the user.]

Rootkit: a software program written to control an operating system.

I've been running OS X since the beginning (when we left System 9 behind), and I've never had anti-virus software. But recent events made me start thinking about being more proactive.

I download and install a lot of non-Apple, non-App Store software, much of it for bioinformatics. I'm well aware that if I type in my admin password for some evil app, there isn't much protection. Yet I run in admin mode all the time, because I find it a pain to be on a plain account. That's the first thing to change if you're not in the same situation.

The most important aspect of my security policy is that I don't have anything on the computer I can't afford to have stolen. I don't have any banking passwords saved. I could be undone by a Keylogger, but that's about it.

My Sharing Prefs panel consists entirely of unchecked boxes. If you want to share stuff you should definitely read Edge's book or another.

I always do a clean install when I upgrade the OS. I have most data backed up to two different hard drives. I pay attention and always update ASAP.

For my desktop, I have automatic login enabled. If someone has physical access to your machine, you're pretty much hosed anyway. (Unless you want to do FileVault).

I've never been crazy about anti-virus software:

the market share argument is MS propaganda and FUD---OS X with no sharing is hard to crack

most (all?) Pwn2Own stuff is user account-only, not root (not positive about this, how hard is privilege escalation?)

anti-virus never protects against zero-day exploits

the AV guys talk a lot to drum up business, like ambulance-chasing, ~~bottom-feeding blood-sucking~~ lawyers

Nevertheless, the news that the Java exploit somehow got around the requirement for an admin password really worries me. It's not easy to find out how it works (I suppose they don't want copycats). So I'm happy that I don't have Java on my machines running Lion, and I disabled it on the old one.

I've taken two new steps toward heightened security. I downloaded and am evaluating Little Snitch.

So far, I haven't seen any evidence of network connections that wouldn't be expected. However, I was a bit surprised to see that many apps (including all Apple apps) phone home each time they are launched (origin of the phrase).

Also, there are a lot of things (Agents and daemons) doing network stuff that I wasn't aware of---as one example, PubSubAgent.

This agent checks for updates to your RSS feeds, of which I have none (too old fashioned). At some point recently, a Safari update changed the Pref for this to "automatically update articles in Bookmarks bar". It bothered me because I've been methodical about deleting cookies and website data when I'm done. Yet PubSubAgent wanted to connect to feeds.arstechnica.com. I thought, how would they know about that? before understanding that it's a Safari RSS thing. Anyway, I turned off the reader as a test, and we'll see if PubSubAgent doesn't disappear.

I heard about Lingon in Edge's book. It's an app for managing LaunchAgents and (I thought from the book) LaunchDaemons.

Lingon is from Peter Borg, who made (still makes) Smultron. I used Smultron a lot in the days before TextMate, and loved it. So I downloaded the app from the AppStore, but I'm disappointed because it's been crippled. It only works with current user-specific processes and won't show me anything about system stuff.

One thing it did show me is an automatic Adobe Reader update job. I only installed Reader (300 MB!!) for one time when I had to use it. That job is now deleted.

The second new tool that I'm working with is ClamAV, an open source anti-virus tool. I read about it here and got it (with a Mac GUI) from here.

Apparently ClamAV has sucked for OS X in the past, but lately things have changed for the better. Anyway, the scan didn't find anything on my account's files.

What else should I worry about?

My Wi-Fi network is not "closed", the name is visible when you scan for networks. This could be a security liability. I don't know if it's possible to do a brute-force password attack on the Airport---something to read up on. But at least I have a decent password. Luckily, I live in a place where very few people drive by.

My ip address from the isp is supposed to be dynamic, but it's not---hasn't changed forever, as far as I know. Apparently one can force a change by spoofing a new MAC address for the router, but I don't want to do that. Maybe I'll go rummage in the basement for an old router and see what happens.

Rootkits certainly do exist for OS X. There have been ways to look for them (by scanning hashes of system files for alterations) like this Linux tool.

I believe that Code Signing should make standard rootkits obsolete. But maybe not, since this is not for "scripts" and such but full-fledged applications. Perhaps a question to apple.stackexchange is in order.

Finally, I think the advantage has to lie with the offense, in a situation where they cannot themselves be attacked. Structure your defense by asking yourself first: what's the worst that could happen if I did get hacked? And take steps to ameliorate that.

One last thing. The future of OS X is clearly not good for people like me: my computer will be like my iPhone. Apps can do whatever they want on the network, there's so much network activity (and push stuff), that the user can't make sense of it, but in theory an app will be so completely sandboxed it can't touch the rest of the machine. My programs won't be able to just do whatever they want anymore. Like the impact of barbed-wire fencing on the old West.

Sorry for the rambling post..

Sunday, April 22, 2012

PyObjC Templates for Xcode are back!

I came across a page from a guy who went to the trouble of making and posting new templates for using PyObjC with Xcode 4.

Thanks!

They are on github here. I just followed the instructions:

Copy the File Templates and Project Templates folders to the following path in your home directory, creating any missing intermediate directories if needed:

~/Library/Developer/Xcode/Templates/

I'm a little rusty with Xcode (and it's become more and more like flying the space shuttle or something. Still, I was able to recycle this old project, while changing it just a bit.

One funny thing, I couldn't figure out how to hook up the text field outlet in the old way. I had to use bindings, and of course I did so for the popup as well (both Content and Selected Index).

I don't have a lot of time to fool with Xcode any more, but I do want to check out Greg's blog. Thanks again.

Here's the AppDelegate:

#-*- coding: utf-8 -*-

from Foundation import *
from AppKit import *

class SpeakerAppDelegate(NSObject):
    TF = objc.ivar('TF')
    voiceL = objc.ivar('voiceL')
    nameL = objc.ivar('nameL')
    selectedIndex = objc.ivar('selectedIndex')
    notSpeaking =  objc.ivar('notSpeaking')
    pretext = 'Hi, my name is '
    
    def init(self):
        s = NSSpeechSynthesizer.alloc().initWithVoice_(None)
        self.speaker = s
        self.speaker.setDelegate_(self)
        self.setVoices()
        self.notSpeaking = True
        self.setSelectedIndex_(0)
        return self
    
    def setTF_(self, value):
        if value != self.TF:
            self.TF = value

    def setSelectedIndex_(self, value):
        if value != self.selectedIndex:
            self.selectedIndex = value
        name = self.nameL[self.selectedIndex]
        self.setTF_(self.pretext + name)

    def setVoices(self):
        L = NSSpeechSynthesizer.availableVoices()
        self.voiceL = L
        DL = [NSSpeechSynthesizer.attributesForVoice_(v) for v in L]
        nL = [D.objectForKey_('VoiceName') for D in DL]
        self.nameL = NSMutableArray.arrayWithArray_(nL)
        NSLog("%s" % self.nameL)
    
    @objc.IBAction
    def speak_(self,sender):
        i = self.selectedIndex
        NSLog("%i Say:  %s" % (i, self.TF))
        self.speaker.setVoice_(self.voiceL[i])
        self.notSpeaking = False
        self.speaker.startSpeakingString_(self.TF)
        pass
    
    def speechSynthesizer_didFinishSpeaking_(self,speechSyn,flag):
        NSLog("didFinishSpeaking_")
        self.notSpeaking = flag

Saturday, April 21, 2012

Damn you, autocorrect

Apparently, I'm not the only one who has problems with autocorrect.

At some point (Lion?), Safari acquired this ability. I keep seeing weird errors in my posts that I have no idea where they came from. So I turned it off system-wide under Language and Text Prefs. I also unchecked the box (which is checked in the screenshot) for replacing (c) with the symbol ©, etc.

Testing

It seems improbable, but it is a fact that on the first exam in some courses, a few students may score lower than expected for guessing at random. We usually see some scores of less than 20 out of 100 for an exam whose questions have five possible answers ABCDE.

We can turn this into an illustration of statistical significance. How high must a score be before one can dismiss the null hypothesis (no knowledge of the subject) with a p-value of 0.05?

We have a Bernouli trial with p = 0.2.

The expected value or mean is np. There is a beautifully simple derivation of the mean and variance in the wikipedia article, the variance is np(1-p). For our example, the mean is 20 and the variance is 100 x 0.2 x 0.8, so the standard deviation is √16 = 4.

For a large class (N = 100) of randomly answering students, the distribution appraches normal, and we expect that 2.5% of such students will score at or above two standard deviations above the mean. To make our rate for a type I error less than 0.05, we should require a score of 28 or higher.

I wrote a short script to simulate this. Each student takes a test with 100 questions, and each trial consists of a class of 100 students. We count the number of students with scores at or above a threshold T. The results for 10 trials are printed, and the results for 100 trials are averaged.

A few quick points: the mean appears to fall between 20 and 21. I'm sure there's a simple explanation but it eludes me at the moment. The score that 2.5% of students regularly exceed is 28.

This is a great example of the danger of multiple hypothesis testing!

One last thing is that the code is not very smart. What I should have done was extract the statistics for each value of T for each trial, rather than repeating the trial for each value of T. But it runs fast enough, and this way I don't need a data structure to remember the results.

> python script.py 
20:  56 43 54 57 57 54 57 46 45 64 55.0
21:  39 48 42 44 39 44 48 48 50 48 43.5
22:  43 32 39 30 38 35 36 42 33 41 34.6
23:  40 29 35 31 27 19 30 20 31 25 26.0
24:  19 18 24 26 20 22 21 17 15 24 19.2
25:  11 14  8  8 13 11 12 15 17 13 12.9
26:  12  6  6  6 11  4  8  7 10  6  9.2
27:   7  4  6  5  8  7  5  7  7  7  5.9
28:   3  2  4  2  3  1  2  3  5  1  3.6
29:   4  4  2  2  2  1  3  1  0  1  2.2
30:   1  1  1  0  1  1  1  0  0  1  0.9
31:   4  0  1  0  0  0  0  0  0  0  0.6
32:   0  0  0  0  1  0  0  1  0  0  0.3
33:   0  0  0  0  0  1  0  0  0  1  0.1
34:   0  0  0  0  0  0  0  0  1  0  0.1

script.py

import random
    
def student(N=100, p=0.2):
    L = [random.random() <= p for i in range(N)]
    return sum(L)

def trial(T):
    L = [student() for i in range(100)]
    return sum([1 for n in L if n >= T])
    
def mean(L):
    return 1.0*sum(L)/len(L)

for T in range(20,35):
    L = list()
    print '%2d: ' % T,
    for i in range(100):
        L.append(trial(T))
        if i < 10:
            print '%2d' % L[-1],
    print '%4.1f' % mean(L)

Friday, April 20, 2012

Hex puzzle [-1]

Here is a fourth (and final) post in the saga of the "Let us have hex" puzzle. (Actual puzzle, reddit comments, and previous posts here, here and here).

Although I know the redditeers have already solved the problem, I had hoped to work my way through it more systematically. But I ran into a constraint I can't get around very easily (or perhaps at all), which is that the available data is not complete. It's not enough to yield a systematic solution.

An important feature of the puzzle is that visitors to the puzzle page get served 100 digit hex strings that are refreshed to a new value slowly (maybe every half hour), and also might (at least theoretically) be restricted by ip address, so that a social venue like reddit is critical to gathering all the data. But the reddit faction quit too soon, after they'd guessed the answer.

The total number of strings I found in the reddit comments is 340. Let me show two demonstrations that the complete data is not there. First, recall that the base64 encodings have "codons" of length 4. So two adjacent 50-character base64 strings that are not correctly aligned (because they abut without overlap, and were decoded individually), might be missing as many as 4 base64 characters at the joint. This would affect in turn at most 2 adjacent characters in the C or PHP program, since the ratio is 4:3.

A simple decoding of string i = 6 gives:

< 73; i++) { pri
{ echo $a[$i];

The first line is C and the second PHP. The problem is that there's nothing in the known data that corresponds to what likely comes next---there's no "echo $b[$i];" in the data. It's just missing. I suppose there could be a series of very short overlaps that would build it, but it seems pretty unsporting. Similarly, the formatting of the C output is not there.

More conclusive evidence comes from an analysis of the solution. The first word is "Design" and it is derived from the C program, which has this fragment:

int a[] = {4, 4, 6, 5, 7, 3, 6, 9, 6, 7..

If we combine these as hex digits and then print the ASCII equivalent we get:

>>> L = ['44','65','73','69','67']
>>> L = ['0x' + s for s in L]
>>> L = [chr(int(s,base=16)) for s in L]
>>> L
['D', 'e', 's', 'i', 'g']

We need the "n" in design which is '0x6e' in hex:

>>> hex(ord('n'))
'0x6e'

But there is no 'e' in the data with a leading 6. The overlap is not there.

I think I'm done now. I had fun. I wrote a lot of code to try to deal with the challenge of how to assemble a bunch of strings where the candidate strings with a large overlap might be misleading. I was prepared to do the thing by hand, as shown in the screenshot. But there is not enough data, you have to guess. Here is what I was reduced to:

Sunday, April 15, 2012

Hex puzzle (3)

I've been working on the hex puzzle (previous posts here and here). We have a list of 340 unqiue lines of data, each 100 characters in length. The first and last lines have been given to us as i = 137 and i = 61, respectively.

The data file ('reddit_hex.txt') is on Dropbox here.

The data seem to be hexadecimal bytes with an unusual distribution that leads to the first part of the decoding: these are bytes that represent ASCII characters in the set A..Za..z0..1. This suggests the possibility that the given data are a hex representation of base64 (although the last two characters '/' and '+' were not found).

Unfortunately, I haven't yet been able to stitch the lines together in a way that generates a 1936 hex digit mesage, with the correct ends, to be decoded. Apparently I go wrong at some points in the assembly, probably because of repeats in the data stream, although perhaps I just messed up.

It might be a disappointment to you, but I'm going to try using another critical clue gleaned from the comments to the reddit post. After converting the base64 data to actual ASCII characters, it turns out that there are two intertwined messages: the alternating bytes form a C program and a PHP program. So I wrote a script that uses all these clues to analyze each of the lines of the original data.

Let's look at some sample output, say the first entry (i = 137):

> python demo.py 137
137 497a78705032357759326873634855675a43526c595341675044317a494852685a484a70636d39684c6e6c6f4b4434354369
    I z x p P 2 5 w Y 2 h s c H U g Z C R l Y S A g P D 1 z I H R h Z H J p c m 9 h L n l o K D 4 5 C i
    IzxpP25wY2hscHUgZCRlYSAgPD1zIHRhZHJpcm9hLnloKD45Ci
  0 ('#include <stdio.h>', '<?php $a = array(9')

What's going on here is that we have 100 hexadecimal digits or 50 bytes in the original data. Each byte can be converted to its ASCII representation as shown in the second and third lines.

>>> chr(int('0x49',base=16))
'I'

We can do the whole line at once with:

>>> import binascii
>>> s = '497a78705032357759326873634855675a43526c595341675044317a494852685a484a70636d39684c6e6c6f4b4434354369'
>>> binascii.unhexlify(s)
'IzxpP25wY2hscHUgZCRlYSAgPD1zIHRhZHJpcm9hLnloKD45Ci'

The base64 data decoding takes 4 characters at a time. For example:

>>> import base64
>>> d = base64.b64decode
>>> d('Izxp')
'#<i'
>>> s = 'IzxpP25wY2hscHUgZCRlYSAgPD1zIHRhZHJpcm9hLnloKD45Ci'
>>> d(s[:-2])
'#<i?npchlpu d$ea  <=s tadriroa.yh(>9'

Because base64 is a 4-letter code, there are 4 possible frames, and we only know the frame for the first line of data. It turns out that for this data, for each line, only one of the frames yields just ASCII characters. In the script below, we filter out decodings like this one:

'\xd0\xb0\x80\xc8\xb1\x80\xc0\xb0\x80\xd8\xb0\x80\xc4\xb1\x80\xd8\xb0\x81'

by turning the whole line into an array of ints and then asking whether any is > 128.

def filter(line):
    B = [struct.unpack('B',b)[0] for b in line]
    return not max(B) > 128

After conversion from base64 to ASCII, we build strings from the alternate bytes, resulting in the output line:

0 ('#include <stdio.h>', '<?php $a = array(9')

This indicates that the two strings were derived from frame 0. It's obvious that the first one is the beginning of a C program, and the second certainly looks like PHP to me.

Note that we can't just use the data as is, because we have thrown away bytes at the joints. Decoding in frame 0, we lose the 'Ci' at the end of the base64 data.

Also, consider i = 8:

> python demo.py 8
  8 357759326873634855675a43526c595341675044317a494852685a484a70636d39684c6e6c6f4b4434354369787049473432
    5 w Y 2 h s c H U g Z C R l Y S A g P D 1 z I H R h Z H J p c m 9 h L n l o K D 4 5 C i x p I G 4 2
  8 5wY2hscHUgZCRlYSAgPD1zIHRhZHJpcm9hLnloKD45CixpIG42
  2 ('clude <stdio.h>\nin', 'hp $a = array(9, 6')

Notice that 'clude <stdio.h>\nin' looks like a continuation of '#include <stdio.h>' from the first line.

>>> import base64
>>> d = base64.b64decode
>>> d('Y2hscHUg')
'chlpu '

The 'cl' of 'clude' are coming from the base64 'Y2hs'. It's reassuring that we can align 0 and 8:

IzxpP25wY2hscHUgZCRlYSAgPD1zIHRhZHJpcm9hLnloKD45Ci
      5wY2hscHUgZCRlYSAgPD1zIHRhZHJpcm9hLnloKD45CixpIG42

So the plan is, to stitch data lines together one at a time, and examine the result to be sure it makes sense after going through base64 decoding and de-interleaving.

demo.py

import sys, binascii, base64, struct

def load_data(fn):
    FH = open(fn,'r')
    data = FH.read().strip()
    FH.close()
    return data.strip().split('\n')

def untangle(s):
    s1 = ''.join([s[i] for i in range(0,len(s),2)])
    s2 = ''.join([s[i] for i in range(1,len(s),2)])
    return s1,s2

def filter(line):
    B = [struct.unpack('B',b)[0] for b in line]
    return not max(B) > 128

def process(s):
    d = base64.b64decode
    rL = list()
    for i,j in [(0,-2),(1,-1),(2,len(s)),(3,-3)]:
        line = d(s[i:j])
        if filter(line):
            rL.append((i,untangle(line)))
    return rL
    
def analyze_line(i,L):
    D = {'i':i}
    s = L[i]
    D['hex'] = s
    s = binascii.unhexlify(s)
    D['b64'] = s
    result = process(s)
    assert len(result) == 1
    j, t = result[0]
    D['frame'] = j
    D['t'] = t
    return D
    
def format_dict(D):
    s1 = '%3d ' % D['i'] + D['hex']
    s2 = '    ' + ' '.join(list(D['b64'])) 
    s3 = '%3d ' % D['i'] + D['b64']   
    s4 = '%3d ' % D['frame'] + str(D['t'])
    return '\n'.join([s1,s2,s3,s4])

L = load_data('reddit_hex.txt')
DL = [analyze_line(i,L) for i in range(len(L))]

try:
    i = int(sys.argv[1])
    text = format_dict(DL[i])
    print text, '\n'
except IndexError:
    for D in DL:
        text = format_dict(D)
        print text, '\n'

Tuesday, April 10, 2012

Hex puzzle (2)

Let's continue with the puzzle introduced in the last post (here).

The next step is to stitch the lines together to form the complete text. If they contained random bytes, that would be relatively easy. There are 65536 possible two-byte combinations. We have a total of 340 * 25 = 8500 two-byte (4 hex character) combinations in our data, so the expected number of times any combination should appear in random data is much less than 1.

However, the data is not random, instead, it is highly skewed. Simple modifications to count.py (from last time) show a total of only 370 two-byte combinations in the data (i.e. 65166 of all possible values are not observed), and the top 20 have counts ranging from 464 down to 72.

464  4341  CA
 331  4943  IC
 309  4167  Ag
 281  4c43  LC
 234  734c  sL
 174  4173  As
 154  4377  Cw
 140  7773  ws
 139  7349  sI
 139  674d  gM
 139  674c  gL
 137  4c44  LD
 126  674e  gN
 104  6749  gI
 103  7767  wg
  95  4e69  Ni
  89  5973  Ys
  78  4944  ID
  76  4947  IG
  72  4e6a  Nj

In the table above, the first column is the count, the second and third are the hex and ASCII representations.

That compares with a maximum count of 3-4 for random data, and an observed total of 8000 different values on an average run with random data.

This suggests that we should undertake some data exploration to see whether the repetitive character of the data will cause difficulties for the assembly.

The core operation is to take an individual entry at index i in the data (the query), and depending on a parameter (tlen) for the number of characters to choose at the very end of the query, make a substring sequence t, the target to search for. Then, we go through all the other entries looking for a match containing t, and when one is found we save the index of the query and the match as well as the position k where the target t begins in the match string.

The quality of these matches can be assessed in various ways. We expect that the sequence of each match string upstream of the target should be exactly the same as the query. We calculate the offset o where the beginning of the match string should be found in the query. Since this depends on tlen (o = len(L[i]) - k - n) we calculate it for each hit and save it with the other parameters. The offset is also useful for printing the actual alignments in show.

A second quality test is to examine the match strings downstream of the target. In the region of overlap, we expect they should be identical as well. A failure of either test would suggest that the data is causing trouble for our simple search algorithm.

As mentioned last time, the observed hex digits comprise all (and only) those which encode ASCII characters 0..1A..Za..z. Rather than print out strings of 100 characters, it will simplify the output to convert the data to ASCII first. We're assuming that this is what we need to do with the data, but I peeked at the answer on reddit, so it's safe to go ahead.

python convert.py > data.mod.txt

import binascii
from utils import load_data

L = load_data('reddit.txt')
for line in L:
    print binascii.unhexlify(line)

Note the extremely repetitive nature of the data by comparing some adjacent output lines:

A1NiAsNiAsMyAsMSA0NiwsICAyJyxmICcwLCwgIDc2KSw7ICAx
A1NywsICAyNSwsICA0NiwsICAyJyxmICcwLCwgIDc2KSw7ICAx

AgLDUgLDcgLDMgLDAgLDAgLDcgLDIgLDIgLDYgLDYgLDEgNTQs
AgLDUgLDcgLDMgLDAgLDAgLDcgLDYgLDIgLDYgLDYgLDEgNTQs

xpIG42dCwgIG00YSxpIG42KCwpICAzeywgICAyICwgICA2ICxp
xpIG43dCwgIGE3WyxdICA3PSwgIHswNCwsICA2NCwsICAxNiws

However, for a well-chosen query (one with a rare two-byte combination as the target) and a long enough target length, it's easy to find matches that appear correct. In the output below, we examine the string found at index 0 in the data as the query. The analysis finds 25 matches, data strings where the terminal 4 characters appear somewhere in the match, and all of those matches are perfect ones whether looking upstream and comparing against the query, or looking downstream in each one (beyond the target) and comparing with each other, as shown in the second half of the output.

We'll exercise these routines a bit more next time. I want to see how big to make the target size, and also whether we should exclude some strings from the preliminary analysis. But overall, it looks like we might be OK.

> python search.py
i= 0 : 25 matches found
0NiwsICAyJyxmICcwLCwgIDc2KSw7ICAxJCxiICA2PSwgIGExc
                                              GExc

 NiwsICAyJyxmICcwLCwgIDc2KSw7ICAxJCxiICA2PSwgIGExc
  iwsICAyJyxmICcwLCwgIDc2KSw7ICAxJCxiICA2PSwgIGExc
      CAyJyxmICcwLCwgIDc2KSw7ICAxJCxiICA2PSwgIGExc
       AyJyxmICcwLCwgIDc2KSw7ICAxJCxiICA2PSwgIGExc
           xmICcwLCwgIDc2KSw7ICAxJCxiICA2PSwgIGExc
            mICcwLCwgIDc2KSw7ICAxJCxiICA2PSwgIGExc
              CcwLCwgIDc2KSw7ICAxJCxiICA2PSwgIGExc
               cwLCwgIDc2KSw7ICAxJCxiICA2PSwgIGExc
                wLCwgIDc2KSw7ICAxJCxiICA2PSwgIGExc
                 LCwgIDc2KSw7ICAxJCxiICA2PSwgIGExc
                  CwgIDc2KSw7ICAxJCxiICA2PSwgIGExc
                   wgIDc2KSw7ICAxJCxiICA2PSwgIGExc
                    gIDc2KSw7ICAxJCxiICA2PSwgIGExc
                              CAxJCxiICA2PSwgIGExc
                               AxJCxiICA2PSwgIGExc
                                xJCxiICA2PSwgIGExc
                                 JCxiICA2PSwgIGExc
                                  CxiICA2PSwgIGExc
                                   xiICA2PSwgIGExc
                                       A2PSwgIGExc
                                        2PSwgIGExc
                                          SwgIGExc
                                            gIGExc
                                             IGExc
                                              GExc
no mismatches upstream

GExc
    j
    jR
    jRyLGE
    jRyLGEg
    jRyLGEgeTYo
    jRyLGEgeTYoL
    jRyLGEgeTYoLDc
    jRyLGEgeTYoLDcg
    jRyLGEgeTYoLDcgL
    jRyLGEgeTYoLDcgLD
    jRyLGEgeTYoLDcgLDQ
    jRyLGEgeTYoLDcgLDQg
    jRyLGEgeTYoLDcgLDQgL
    jRyLGEgeTYoLDcgLDQgLDkgLDIgLDY
    jRyLGEgeTYoLDcgLDQgLDkgLDIgLDYg
    jRyLGEgeTYoLDcgLDQgLDkgLDIgLDYgL
    jRyLGEgeTYoLDcgLDQgLDkgLDIgLDYgLD
    jRyLGEgeTYoLDcgLDQgLDkgLDIgLDYgLDA
    jRyLGEgeTYoLDcgLDQgLDkgLDIgLDYgLDAg
    jRyLGEgeTYoLDcgLDQgLDkgLDIgLDYgLDAgLDUg
    jRyLGEgeTYoLDcgLDQgLDkgLDIgLDYgLDAgLDUgL
    jRyLGEgeTYoLDcgLDQgLDkgLDIgLDYgLDAgLDUgLDc
    jRyLGEgeTYoLDcgLDQgLDkgLDIgLDYgLDAgLDUgLDcgL
    jRyLGEgeTYoLDcgLDQgLDkgLDIgLDYgLDAgLDUgLDcgLD
    jRyLGEgeTYoLDcgLDQgLDkgLDIgLDYgLDAgLDUgLDcgLDM
no mismatches downstream

search.py

from utils import load_data

# tlen is minimum overlap
# functions deal with one string at a time
# do not assume upstream matches
# e is search string, f is potential match

def one_search(i,L,tlen):
    # search result has associated tlen
    oL = list()
    e = L[i]
    t = e[-tlen:]
    for j,f in enumerate(L):
        if i == j or not t in f:
            continue
        # offset is pos in e where f should start
        # not guaranteed to match except over t
        offset = len(e) - f.index(t) - tlen
        oL.append((j,offset))
    oL.sort(key=lambda item:item[1])
    return oL

# mmL is a list of mismatched strings
# ummL contains upstream mismatches
def test_upstream(i,L,oL,ummL):
    e = L[i]
    for j,o in oL:
        f = L[j]
        if not f.startswith(e[o:]):
            if i < j:
                ummL.append((i,j))
            else:
                ummL.append((j,i))
    
def trim_downstream(oL,L,tlen):
    pL = list()
    for j,o in oL:
        f = L[j]
        k = len(f) - o
        pL.append((j,f[k:]))
    return pL
    
def test_downstream(pL,dmmL):
    mismatches = 0
    for i,s1 in pL:
        for j, s2 in pL:
            if s1 == s2 or len(s1) < len(s2):
                continue
            if not s1.startswith(s2):
                if i < j:
                    dmmL.append((i,j))
                else:
                    dmmL.append((j,i))
    
if __name__ == '__main__':             
    L = load_data('data.mod.txt')
    tlen = 4
    i = 0
    e = L[i]
    oL = one_search(i,L,tlen)
    
    print 'i=', i, ':', len(oL), 'matches found'
    print e
    print ' ' * (len(e)-tlen) + e[-tlen:]
    print
    for j,o in oL:
        k = len(e) - o
        print ' '*o + L[j][:k]
        
    ummL = list()
    test_upstream(i,L,oL,ummL)
    if ummL:
        print len(ummL), 'mismatches'
    else:
        print 'no mismatches upstream\n'

    dmmL = list()
    print e[-tlen:]
    pL = trim_downstream(oL,L,tlen)
    for item in pL:
        print ' '*tlen + item[1]
    test_downstream(pL,dmmL)
    if dmmL:
        print len(dmmL), 'mismatches'
    else:
        print 'no mismatches downstream'

Monday, April 9, 2012

Hex puzzle (1)

Last Saturday, an interesting puzzle was posted on reddit/r/programming. I'm having trouble finding the original post, but the actual puzzle page is here, and the (complete) comments to the post are here.

As you can see from the screenshot or the link

the puzzle is a message consisting of 1936 hex digits, and upon further manipulation it promises to yield the answer. You can actually follow the link to see the answer if you're impatient. I wasn't able to compete (I had a party to cook for, and could not have gone as fast as these guys anyway), but I'd still like to understand how they got there.

Apparently, we need to assemble a number of overlapping 100 digit pieces of a 1936 digit message and then figure out what it means. The pieces were served up to different visitors to the site and collected as comments on reddit. The first objective is to recover the observed data. I believe reddit must have an API for this, but I don't know it. What I did was to save the source for the page with all 410 comments (in the file comments.txt) and then run scrape.py below.

I count 251 strings of 100 characters, all in "0..9a..f".

According to the instructions, the first and last strings are marked, and these were recovered and posted on reddit:

497a78705032357759326873634855675a43526c595341675044317a494852685a484a70636d39684c6e6c6f4b4434354369 //first 100 4270494630674f3349675a5831304948566c636d4e75614342764d4341374a79426c494363674f7941674944393950676f3d //last 100

[UPDATE: There is a link to the complete (known) data from reddit to github here. I count a total of 340 strings, substantially more than what I've been working with in early attempts. This should help with the assembly problem. ]

The data clearly has substantial structure, as indicated by the screenshot from a "find" search for the digit "4" in TextEdit.

Since only hexadecimal digits are present it suggests that this might actually be hexadecimal. It's interesting that not all digits are equally represented. In particular, only those bytes (two hex digits) which encode ASCII "a..zA..Z0..9" are present (although the distribution is decidedly odd). So it seems pretty likely that we just need to decode the hex to ASCII and then figure out what it means.

However, there is an assembly problem that comes first.

Output from count.py (listing at the end):

> python count.py 
  54  30   48  0  0
  40  31   49  1  1
 256  32   50  2  2
  71  33   51  3  3
  58  34   52  4  4
  82  35   53  5  5
  64  37   55  7  7
  39  38   56  8  8
  63  39   57  9  9
1070  41   65  A  A
 133  42   66  B  B
1102  43   67  C  C
 414  44   68  D  D
 111  45   69  E  E
  38  46   70  F  F
 169  47   71  G  G
 101  48   72  H  H
 812  49   73  I  I
 151  4a   74  J  J
 132  4b   75  K  K
 667  4c   76  L  L
 399  4d   77  M  M
 532  4e   78  N  N
  83  4f   79  O  O
  82  50   80  P  P
  67  51   81  Q  Q
  65  52   82  R  R
 286  53   83  S  S
  58  54   84  T  T
  51  55   85  U  U
  12  56   86  V  V
  56  57   87  W  W
  31  58   88  X  X
 315  59   89  Y  Y
  92  5a   90  Z  Z
  16  61   97  a  a
  13  62   98  b  b
 278  63   99  c  c
  61  64  100  d  d
  58  65  101  e  e
   8  66  102  f  f
 938  67  103  g  g
  57  68  104  h  h
 247  69  105  i  i
 201  6a  106  j  j
 116  6b  107  k  k
  62  6c  108  l  l
  64  6d  109  m  m
  97  6e  110  n  n
  61  6f  111  o  o
 209  70  112  p  p
 896  73  115  s  s
  25  74  116  t  t
  23  75  117  u  u
  15  76  118  v  v
 532  77  119  w  w
 211  78  120  x  x
 414  79  121  y  y
 192  7a  122  z  z

scrape.py

# python scrape.py > data.txt
fn = 'comments.txt'
FH = open(fn)
data = FH.read()
FH.close()

L = list()
i = 0

while True:
    j = data.find('>',i)
    j += 1
    i=j
    k = j + 100
    if k > len(data):
        break
    if not data[k] == '<':
        continue
    w = data[j:k]
    if ' ' in w or '>' in w or '<' in w:  
        continue
    for c in w:
        assert c in '0123456789abcdef'
    L.append(w)
    if i == 0:
        break

L = list(set(L))
L.sort()

for w in L:
    print w

count.py

import binascii
from collections import Counter

def load_data():
    fn = 'data.txt'
    FH = open(fn,'r')
    data = FH.read().strip()
    FH.close()
    L = data.split('\n')
    return L

L = load_data()
n = 2
cL = list()
for line in L:
    for i in range(0,len(line)-n+1,2):
        cL.append(line[i:i+n])

C = Counter(cL)
for h,n in sorted(C.most_common()):
    i = int(h, base=16)
    print '%4d  %s %4d  %s  %s' %\
    (n, h, i, chr(i), binascii.unhexlify(h))