## Monday, April 30, 2012

### KS test (2)

A bit more about the Kolmogorov-Smirnov test after the first post here. I found a pdf at MIT ocw about this. I don't understand much of it but it's clear that using the absolute value of the difference to get the statistic is correct.

Now we need to figure out the distribution of the statistic under the null hypothesis, so we can set our type I error at 0.05. From my perspective, the way to do this is to do a simulation in Python. The code below does the following N times: grab n=20 numbers from the standard normal and sort them, compute the D statistic as we did in the previous post, and accumulate those values in a list.

We sort the list, then find the place in the list where the statistic for the test case is found. (I used a function from the bisect module for this, but the naive method gives the same result).

I get a p-value of 0.25. This does not match what we had before (0.36), and I am not sure why that is.

To do a little more checking, I found an online reference that gives p-values for various statistic,n combinations.

for n = 20

 ```p-value D 0.20 0.210 0.15 0.246 0.10 0.264 0.05 0.294 0.01 0.356```

According to this, a statistic of 0.20 (what we had) gives a p-value a bit larger than 0.20, but not much. So this source doesn't agree with what we got from R and Python. Maybe the online source is for a one-sided test (it doesn't state clearly), that would give a smaller p-value for a given statistic.

I added calculations for our simulation of d given various p, and vice-versa. The output:

 ```> python script.py d = 0.199 p = 0.25 d = 0.246 p = 0.10 d = 0.264 p = 0.06 d = 0.294 p = 0.03 p = 0.15 d = 0.226 p = 0.10 d = 0.245 p = 0.05 d = 0.275 p = 0.01 d = 0.336```

Not sure what the correct answer is here, but I can't find a problem with the code and I feel like it's time to move on. One last bit is something to think about that is quoted here:

The Kolmogorov-Smirnov test is based on a simple way to quantify the discrepancy between the observed and expected distributions. It turns out, however, that it is too simple, and doesn't do a good job of discriminating whether or not your data was sampled from a Gaussian distribution. An expert on normality tests, R.B. D’Agostino, makes a very strong statement: “The Kolmogorov-Smirnov test is only a historical curiosity. It should never be used.” (“Tests for Normal Distribution” in Goodness-of-fit Techniques, Marcel Decker, 1986).

Take that, Andrey!

 ```from bisect import bisect_left as bl from scipy.stats import norm import matplotlib.pyplot as plt import numpy as np # normal cdf rv = norm() def calc_stat(xL,yL): xL.sort() dL = list() for x,y0 in zip(xL,yL): y1 = rv.cdf(x) dL.append(abs(y1-y0)) return max(dL) def p_value(d,L): #for i,e in enumerate(L): #if d < e: #break i = bl(L,d) p = 1 - ((i+1)*1.0/len(L)) return p #----------------------------------- # simulation N = 10000 n = 20 yL = range(1,n+1) yL = [1.0*y/n for y in yL] #replaces "steps" pL = list() for i in range(N): xL = np.random.normal(0,1,n) D = calc_stat(xL,yL) pL.append(D) plt.hist(pL,bins=50) plt.savefig('example.png') #----------------------------------- pL.sort() for d in [0.199, 0.246, 0.264, 0.294]: p = p_value(d,pL) print 'd = %3.3f' % d, print 'p = %3.2f' % p print for p in [0.15, 0.10, 0.05, 0.01]: i = int(N*(1-p)) d = pL[i] print 'p = %3.2f' % p, print 'd = %3.3f' % d ```

### KS test

I ran into a question about using the Kolmogorov-Smirnov test for goodness-of-fit in R and Python. I'd heard of the KS test but didn't know what it was about so I decided to explore.

Wikipedia's explanation was opaque at first (not surprising), but I found nice ones here and here.

As the references say, the KS test can be used to decide if a sample comes from a population with a specific distribution. We calculate a statistic D, and if the statistic is large enough we reject the null hypothesis that the sample comes from the specified distribution.

What is the D statistic? I think it is just the maximum deviation between the observed cdf for the sample, compared to the distribution. That works for the script below.

The KS test is a non-parametric test, it doesn't depend on assumptions about the distribution (e.g. standard deviation) of the data.

For these tests, I needed to reinstall scipy. The previous difficulties were not observed this time. It was as simple as:

 ```git clone git://github.com/scipy/scipy cd scipy python setup.py build sudo python setup.py install```

The script below grabs 20 numbers from the normal distribution (sd = 2.0) and compares them with the standard normal distribution (sd = 1). The script does a plot (screenshot above) and prints the results of the KS test:

 ```> python script.py max 0.198 D = 0.20, p-value = 0.36```

We are not able to reject the null hypothesis, even though the source of the numbers had sd=2 and the null hypothesis is that sd=1.

I did a quick check by copying the numbers

`x=c(-5.05,-2.17,-1.66,-1.35,-0.98,-0.97,-0.84,-0.74,-0.67,-0.00,0.21,0.24,0.33,0.67,0.94,1.63,1.65,2.05,2.22,3.19)`

and pasting them into R. Then:

 ```> ks.test(x,pnorm,0,1) One-sample Kolmogorov-Smirnov test data: x D = 0.1986, p-value = 0.3611 alternative hypothesis: two-sided```

Which matches. I also wrote a slightly hackish test of the test in R:

 ```f <- function () { c = 0 for (i in 1:100) { y1 = rnorm(n,mean=0,sd=1) result = ks.test(y1,"pnorm",0,1) if (result\$p.value < 0.05) { print (result\$statistic) print (result\$p.value) } } } ```
This function spews out the values of the parameters if the p-value is < 0.05. We're testing against the same distribution from which the samples were drawn (i.e. the null hypothesis is correct), so we expect that about 5 of the 100 tries will have a test statistic so extreme that we're fooled into thinking the null hypothesis is wrong. Output:
 ```> n = 10 > f() D 0.454372 [1] 0.020945 D 0.4113619 [1] 0.04811348 D 0.4722679 [1] 0.01440302 D 0.43666 [1] 0.02984022 D 0.5340613 [1] 0.003428069 D 0.4336947 [1] 0.03161219```

Looks good to me.

`script.py`
 ```import matplotlib.pyplot as plt from scipy.stats import uniform, norm, kstest import numpy as np np.random.seed(1773) def steps(L): delta = 1.0/len(L) y = 0 rL = list() for x in L: y += delta rL.append(y) return rL xL = np.random.normal(0,2.0,20) xL.sort() yL = steps(xL) #---------------------------------- # points plt.scatter(xL, yL, s=50, zorder=1) # distribution rv = norm() pL = np.linspace(-3, 3) plt.plot(pL, rv.cdf(pL), '--') # vertical lines dL = list() for x,y0 in zip(xL,yL): y1 = rv.cdf(x) dL.append(abs(y1-y0)) plt.plot((x,x),(y0,y1), color='r', zorder=0) ax = plt.axes() ax.set_xlim(-4, 4) ax.set_ylim(-0.1, 1.1) plt.savefig('example.png') print 'max %3.3f' % max(dL) t = kstest(xL,'norm') print 'D = %3.2f, p-value = %3.2f' % t xL = ['%3.2f' % n for n in xL] print ','.join(xL) ```

## Saturday, April 28, 2012

### Build R

Just for fun, I decided to see if I could build it from source. I ran into only one hiccup, related to gfortran. The gfortran compiler is a new one I got the other day to build Julia.

 ```> gfortran -v .. Target: x86_64-apple-darwin11 .. gcc version 4.6.2 20111019 (prerelease) (GCC) ```

Here are the build instructions for R:

 ```# download R curl -O http://cran.r-project.org/src/base/R-2/R-2.15.0.tar.gz # unpack R tar fxz R-2.15.0.tar.gz # choose the architecture (x86_64, i386, ppc or ppc64) arch=x86_64 # create a build directory mkdir R-\$arch cd R-\$arch # configure R ../R-2.15.0/configure r_arch=\$arch CC="gcc -arch \$arch" \ CXX="g++ -arch \$arch" F77="gfortran -arch \$arch" \ FC="gfortran -arch \$arch" OBJC="gcc -arch \$arch" \ --x-includes=/usr/X11/include --x-libraries=/usr/X11/lib \ --with-blas='-framework vecLib' --with-lapack # build R make -j8 # At this point you can run R as `bin/R' even without installing # install R make install # in some cases you may need to use ``sudo make install'' instead```

The configure step fails with a complaint about the options to gfortran. The config.log file reads:

 ```.. configure:7007: result: defining F77 to be gfortran -arch x86_64 configure:7179: checking for Fortran 77 compiler version configure:7188: gfortran -arch x86_64 --version >&5 GNU Fortran (GCC) 4.6.2 20111019 (prerelease) Copyright (C) 2011 Free Software Foundation, Inc. GNU Fortran comes with NO WARRANTY, to the extent permitted by law. You may redistribute copies of GNU Fortran under the terms of the GNU General Public License. For more information about these matters, see the file named COPYING configure:7199: \$? = 0 configure:7188: gfortran -arch x86_64 -v >&5 Driving: gfortran -mmacosx-version-min=10.7.3 -arch x86_64 -v -l gfortran -shared-libgcc gfortran: error: x86_64: No such file or directory Using built-in specs. COLLECT_GCC=gfortran COLLECT_LTO_WRAPPER=/usr/local/gfortran/libexec/gcc/x86_64-apple-darwin11/4.6.2/lto-wrapper gfortran: error: unrecognized option '-arch'```

Removing `"-arch \$arch"` from F77 and FC above, fixed it. R builds pretty quickly, then this works:

It even gives a Quartz window from which I can save the figure.

I don't have any real reason to do this, just thought I'd try it. Also, an important question is whether shared libraries like `/usr/X11/lib/libpng.dylib` are linked in properly. This works (but I'm not sure where it's getting libpng from):

 ```> png('~/Desktop/x.png') > plot(1:10) > dev.off() null device 1 > ```

## Friday, April 27, 2012

### Python mergesort

I need to make a serious commitment to algorithms at some point. I have the 3rd edition of Sedgewick, and worked through volume 1 some years ago. I see the 4th edition has moved to Java; I think that's a big mistake. I would hope for a basic algorithms book in Python, but I'll take C++ over Java any day. (I also have Cormen but put it aside).

I should really buy Knuth's books, but worry they'd be beyond my abilities.

I also have Magnus Hetland's book and Zelle's also, and I think they're great, but haven't found the time to be systematic about either of them either. They are for this summer.

Just for fun, here is a modified version of mergesort (from my book).

There are a couple of ideas here. First, suppose we have two lists that are already sorted. It's easy to merge them into one sorted list, by comparing the next elements in each and choosing the correct one. That's what merge does. The second idea is recursion. That's what mergesort does. The code below is set up to read a power of ten from the input.

This is a good challenge for a beginning Python coder.

Here is some timing on mine:

 ```> time python merge.py 3 real 0m0.054s user 0m0.039s sys 0m0.013s > time python merge.py 4 real 0m0.232s user 0m0.216s sys 0m0.014s > time python merge.py 5 real 0m8.201s user 0m8.178s sys 0m0.020s > ```
It hangs with N = 1e6, I am not sure why yet.

merge.py
 ```import random, sys def merge(left,right): rL = list() while True: if left and right: if left[0] < right[0]: rL.append(left.pop(0)) else: rL.append(right.pop(0)) continue if left: rL.extend(left) elif right: rL.extend(right) return rL def merge_sort(iL): if len(iL) == 1: return iL n = len(iL)/2 left = merge_sort(iL[:n]) right = merge_sort(iL[n:]) result = merge(left,right) return result if __name__ == '__main__': try: n = int(sys.argv[1]) except: n = 4 N = 10**n L = range(N) random.shuffle(L) result = merge_sort(L) assert result == sorted(L) ```

### UTF-16 endian-ness

This is a brief continuation of the post on hexdump. I saved the same data in different files as UTF-16 in both big-endian and little-endian formats using TextMate.

 ```> hexdump -C text16B.txt 00000000 fe ff 00 54 00 68 00 69 00 73 00 20 00 69 00 73 |...T.h.i.s. .i.s| 00000010 00 20 00 73 00 6f 00 6d 00 65 00 20 00 74 00 65 |. .s.o.m.e. .t.e| 00000020 00 78 00 74 00 0a 00 2d 00 2d 00 e4 00 2d 00 2d |.x.t...-.-...-.-| 00000030 00 3a 00 20 00 20 00 61 00 20 00 77 00 69 00 74 |.:. . .a. .w.i.t| 00000040 00 68 00 20 00 75 00 6d 00 6c 00 61 00 75 00 74 |.h. .u.m.l.a.u.t| 00000050 00 0a 00 2d 00 2d 22 1a 00 2d 00 2d 00 3a 00 20 |...-.-"..-.-.:. | 00000060 00 20 00 73 00 71 00 72 00 74 |. .s.q.r.t| 0000006a > hexdump -C text16L.txt 00000000 ff fe 54 00 68 00 69 00 73 00 20 00 69 00 73 00 |..T.h.i.s. .i.s.| 00000010 20 00 73 00 6f 00 6d 00 65 00 20 00 74 00 65 00 | .s.o.m.e. .t.e.| 00000020 78 00 74 00 0a 00 2d 00 2d 00 e4 00 2d 00 2d 00 |x.t...-.-...-.-.| 00000030 3a 00 20 00 20 00 61 00 20 00 77 00 69 00 74 00 |:. . .a. .w.i.t.| 00000040 68 00 20 00 75 00 6d 00 6c 00 61 00 75 00 74 00 |h. .u.m.l.a.u.t.| 00000050 0a 00 2d 00 2d 00 1a 22 2d 00 2d 00 3a 00 20 00 |..-.-.."-.-.:. .| 00000060 20 00 73 00 71 00 72 00 74 00 | .s.q.r.t.| 0000006a```

The first two bytes are a byte order mark or BOM. You can see how that works.

Notice the change in our example code points from last time. The ä ('LATIN SMALL LETTER A WITH DIAERESIS'), which is c3 a4 in UTF-8, is 00 e4 in big-endian UTF-16 (and the reverse in little-endian)

The √ is e2 88 9a in UTF-8, and is 22 1a in big-endian UTF-16 and reversed as well in little-endian.

The most important point is that the bytes are completely different depending on the encoding (UTF-8 versus UTF-16), and a difference of encoding is usually the source of weird or unexpected stuff you see printed on the screen, when the encoding and decoding don't match.

I wish they had come up with some other words to describe the byte order than big- and little-endian. The reference to Gulliver is amusing, but a string of bytes has two ends. Which is supposed to be the big one? It's similar to the situation with genes, where people talk about the 5' and 3' ends, but of course each end of double-stranded DNA has both 5' and 3' ends!

Big-endian is described as the "most significant byte first." How about "highest value byte first" or "natural order" ?

### hexdump

This is a quick look at hexdump, with a brief discussion of UTF encodings.

This is some text
--ä--: a with umlaut
--√--: sqrt

Paste that into a text editor (like TextMate) and save in UTF-8 encoding (default).

 ```> hexdump -C text8.txt 00000000 54 68 69 73 20 69 73 20 73 6f 6d 65 20 74 65 78 |This is some tex| 00000010 74 0a 2d 2d c3 a4 2d 2d 3a 20 20 61 20 77 69 74 |t.--..--: a wit| 00000020 68 20 75 6d 6c 61 75 74 0a 2d 2d e2 88 9a 2d 2d |h umlaut.--...--| 00000030 3a 20 20 73 71 72 74 |: sqrt| 00000037```

[ Two other very useful flags are -n (number of bytes to read) and -s (offset, number of bytes to skip. Flags must preceed the filename. ]

The middle two columns are the bytes, and the last is the ASCII version for each, which we requested using the -C flag. Values at or above 128 (hex '0x80'), and I think the unprintable low-value ones, are shown as a "."

The output gives us a chance to practice hex arithmetic. For example, the third byte is '0x69' which is "i" and the fourth byte is '0x73' which is "s". That's because the hex characters between these two are:

 ```'0x6a' '0x6b' '0x6c' '0x6d' '0x6e' '0x6f' '0x70' '0x71' '0x72' "j" "k" "l" "m" "n" "o" "p" "q" "r"```

The a with umlaut is also called 'LATIN SMALL LETTER A WITH DIAERESIS' (reference). There is a nice Python doc about Unicode, also wikipedia, where you can learn about the saga of Klingon.

The UTF-8 encoding of 'LATIN SMALL LETTER A WITH DIAERESIS' is

'0xc3' '0xa4'

and you can see these two bytes in the second line of the printout flanked by the repeated bytes '0x2d' for "--"

The way this works is that UTF-8 is a variable length encoding:

 ```0zzzzzzz one-byte character 110yyyyy 10zzzzzz two-byte character 1110xxxx 10yyyyyy 10zzzzzz three-byte character```

An ASCII text and its UTF-8 encoding would be identical. Let's grab the bits:

 ```>>> bin(0xc3) '0b11000011' >>> bin(0xa4) '0b10100100'```

In bytes then, '0xc3' '0xa4' is:

11000011 10100100

Remove the "meta" bits (not sure what they're actual name is at the moment), leaving:

00011 100100

00000000 11100100

which is:

00 e4

The two bytes above are the encoding in UTF-16, which we could just have obtained from the reference page.

This format is what's called big-endian.

The second non-ASCII character we have is the square-root symbol √ (encoded in this HTML page as ampersand-radic;)

In the output we have:

e2 88 9a

 ```>>> bin(0xe2) '0b11100010' >>> bin(0x88) '0b10001000' >>> bin(0x9a) '0b10011010'```

In binary, that is:

11100010 10001000 10011010

Stripping out the meta info:

0010 001000 011010

00100010 00011010

In hex:

22 1a

which matches the reference.

I'll have to put off discussion of UTF encodings in Python to another time.

## Thursday, April 26, 2012

### Basic find

This is to remind me of how to use the Unix command find. Possibly only I will find this useful. I'm just "indexing my head."

Make a test directory with one file:

 ```> cd ~/Desktop > mkdir x > touch x/y.txt > cd ```

Basic usage for find is: find dirname -name filename

 ```> find . -name y.txt ./Desktop/x/y.txt >```

Don't forget the -name. Without it, find thinks file_name is just another place to search, since find can take multiple directories to search. "." is, of course, the current directory (I'm in my home directory "~"), while "/" would be the root of the file system.

Another use would be to count the number of files with wc:

 ```> find . | wc -l 158479 >```

Of course, find can use wildcards.

 ```> find ~/Desktop -name \*.txt /Users/telliott/Desktop/notes.txt /Users/telliott/Desktop/todo.txt /Users/telliott/Desktop/x/y.txt```

Notice the backslash to escape: \*

[ UPDATE: Can also be done with quotes "*.txt" ]

The above will also find hidden files. Other useful options for find include:

-type
-mtime
-mmin

For example, find all recently changed files:

 ```> find ~/Desktop -mmin -30 /Users/telliott/Desktop /Users/telliott/Desktop/.DS_Store /Users/telliott/Desktop/.y.txt /Users/telliott/Desktop/x /Users/telliott/Desktop/x/y.txt```

Do this with ~ to find where TextEdit saves all its data about the current state!

With appropriate caution, you might pipe the results to xargs to run a command:

 ```> cd Desktop/ > find . -name \*.txt ./.y.txt ./notes.txt ./Rob Pike.txt ./todo.txt ./x/y.txt > find . -name \*.txt | xargs /bin/ls -al ls: ./Rob: No such file or directory ls: Pike.txt: No such file or directory -rw-r--r-- 1 telliott staff 0 Apr 26 06:23 ./.y.txt -rw-r--r--@ 1 telliott staff 29 Apr 23 15:54 ./notes.txt -rw-r--r--@ 1 telliott staff 815 Apr 24 17:19 ./todo.txt -rw-r--r-- 1 telliott staff 0 Apr 26 06:00 ./x/y.txt```

Notice what happened to the filename that contains a space. I have to read more to know how to fix that.

If you're searching the system without read permission for directories you'll get error messages:

 ```> find / -name foo find: /.DocumentRevisions-V100: Permission denied find: /.fseventsd: Permission denied find: /.Spotlight-V100: Permission denied find: /.Trashes: Permission denied```

They can be discarded like this:

 `> find / -name foo 2>/dev/null`

This redirects stderr (the "2") to /dev/null, which is a Unix black hole.

Another sort of listing is obtained with du-disk usage, piped to sort and then to tail.

 ```> du -sc * | sort -n | tail 1590728 My Downloads 3188512 Software 3392024 Dropbox 8421800 VirtualBox VMs 8509928 Library 8937696 Teaching 19995376 Pictures-TE 43919360 Music 236592936 Movies 336685360 total```

Learn Unix in 10 minutes (quick reference)
A nice reference for find (and other stuff).

## Wednesday, April 25, 2012

### Julia

Julia is a new "high-level, high-performance dynamic programming language for technical computing." Website here and rationale here. Interview with Stefan Karpinski.

I followed the instructions here to download and build it. The only issue is that I failed to read the platform-specific notes before starting, so I had the wrong (old) gfortran. There are links in the docs, but I followed the notes in this thread and got it from (download link).

[ UPDATE: re the first comment, some discussion here. ]

## Tuesday, April 24, 2012

### big brother

As I mentioned in a previous post, I "upgraded" to iCloud from MobileMe basically in order to keep my .mac email address (which is actually now a .me address, but it forwards, or something). I'm not happy about iDisk going away, since I linked to so much content here on the blog, but that's the way it's going to be. I'm not going to fix all those links by hand to point to a new server. I still have the files, if anyone wants something.

I use Dropbox, and I love it. It's simple to choose which directories and files are shared, and I share everyday between work and home. I've also been grateful for the ability to roll back changes and recover deleted files, and also to share files in my Public folder.

Now comes iCloud.

How do you share selected folders to the cloud? As far as I know, you can't. There is a method explained by a Macworld editor in a post titled "Make Your Own Personal Dropbox With iCloud", and at the end he says:

Apple says that using this folder in the manner described above may lead to data loss
That doesn't sound like something I want to try. And then of course there's this:

I checked the box for "Documents & Data", and then unchecked it, to find this ominous alert. Note there's no way to say, please don't erase all that shared data from my Mac. Big brother knows best. I really have no explanation for it, but Apple have become even bigger a***oles than Microsoft were at their peak.

So it's a very welcome thing to see Google Drive announced today (article).

They would never erase my data for some stupid DRM-related reason. Would they?

### defaultdict

A common use case for dictionaries is to hold a count of the occurrences of a key in some other object, for example, words in a file. In this case, one needs to deal with missing keys:

 ```>>> L = list('abcaba') >>> D = dict() >>> for k in L: ... try: ... D[k] += 1 ... except KeyError: ... D[k] = 1 ... >>> D {'a': 3, 'c': 1, 'b': 2}```

Another solution is to use the setdefault method of dicts:

 ```>>> D = dict() >>> for k in L: ... D.setdefault(k,0) ... D[k] += 1 ... 0 0 0 1 1 2 >>> D {'a': 3, 'c': 1, 'b': 2}```

A better way (certainly better than this second approach) is to use a defaultdict:

 ```>>> from collections import defaultdict >>> D = defaultdict(int) >>> for k in L: ... D[k] += 1 ... >>> D defaultdict(, {'a': 3, 'c': 1, 'b': 2})```

Some of the magic involves __missing__:

 ```>>> 'd' in D False >>> D.__missing__('d') 0 >>> 'd' in D True >>> D defaultdict(, {'a': 3, 'c': 1, 'b': 2, 'd': 0})```

For more you can go to the docs or here or here, or just do this:

 ```>>> print defaultdict.__missing__.__doc__ __missing__(key) # Called by __getitem__ for missing key; pseudo-code: if self.default_factory is None: raise KeyError((key,)) self[key] = value = self.default_factory() return value ```

[ UPDATE: The collections.Counter class was mentioned in a comment. Don't know why I didn't mention it in this post, I've used it in many other posts here. If it's convenient to convert the data to a list of keys, that's the way to go. ]

### A different use of cat

 ```> cat > hello.py << EOF > print "Hello, world!" > EOF > python hello.py Hello, world! > more hello.py print "Hello, world!" > ```

How does this work? According to the man page for cat, if

file is a single dash (`-') or absent, cat reads from the standard input.
The ">" redirects to a file as usual.

The "<<" is special. The whole construct is called a "here document." I found more about it including its proper name here, which leads to a wikipedia entry.

Note that any symbol not present by itself on a line in the desired text will work for termination. Here's a rather silly sequence in which it took me a while to enter the single character 'X' to stop input:

 ```> cat > x.txt << X > x > cat > x.txt << XY > cat > x.txt << XYZ > print me > XYZ > cat > x.txt << X > hi > X > more x.txt x cat > x.txt << XY cat > x.txt << XYZ print me XYZ cat > x.txt << X hi > ```

Quick Unix: a brief look at Unix stuff that I find interesting and not too obvious.

## Monday, April 23, 2012

### mtree on OS X

I came across a utility to test for changes in files (and more stuff, like permissions) in Chapter 4 of EdgeEnterprise Mac Security: mtree.

I'd never heard of it. There's a writeup about it here, as well as a couple of blog entries (here and here)

Rather than go straight to mtree (I still have to read the man page), this is a write-up of the bash script posted on the the first blog post. So it's really more about improving my bash scripting-fu (origin here) than it is about mtree.

 ```> nl script.sh 1 #!/bin/sh 2 SCRIPT="`basename \$0 | sed 's/\..*\$//'`" 3 CONFIG=\${1:-\$HOME/.\$SCRIPT} 4 DIR=\${2:-\$HOME/Documents} 5 cd \$DIR || \ 6 { echo "Failed to change working directory to \$DIR" >&2; exit 1; } 7 if [ -f \$CONFIG ]; then 8 /usr/sbin/mtree -f \$CONFIG 9 fi 10 /usr/sbin/mtree -c -k time,md5digest,ripemd160digest > \$CONFIG> ```

I added line numbers with nl, you could use cat or other methods.

I hope line 1 is obvious (see wikipedia if it's not).

In line two, we take the variable \$0, which is the path to the script, obtain its basename, and then strip off the extension:

 ```> basename /Users/telliott/script.sh script.sh > echo x.txt | sed 's/\..*\$//' x```

I had an even harder time with lines 3 and 4, but the folks on StackOverflow helped me.

I learned that bash has a man page (of course!). Also that one can search within a man page by entering "/ :-" after you get the page, which initiates a search for ":-". The answer:

\${parameter:-word}
Use Default Values. If parameter is unset or null, the expansion
of word is substituted. Otherwise, the value of parameter is sub-
stituted.

So these two lines

 ```3 CONFIG=\${1:-\$HOME/.\$SCRIPT} 4 DIR=\${2:-\$HOME/Documents} ```

simply give default values for \$CONFIG and \$DIR if we don't supply them as arguments (\$1 and \$2) when invoking the script. If we called the script by ./script.sh from the Desktop, then we'd have \$CONFIG be equal to ~/.script (a hidden file) and \$DIR be equal to ~/Documents.

Lines 5 and 6 change the working directory to ~/Documents as default, if this fails (because the path input as an argument is no good) we write a message to >&2 and exit with 1 signifying failure. >&2 is stderr. This writes to the Terminal for me.

Lines 7 through 9 check whether \$CONFIG exists already, if it does then we call mtree with the -f flag and \$CONFIG argument:

/usr/sbin/mtree -f \$CONFIG

From the mtree man page:

-f file
input.

If this option is specified twice, the two specifications are com-
pared with each other, rather than to the file hierarchy. The
specifications be sorted like output generated using -c. The out-
put format in this case is somewhat remniscent of comm(1), having
"in first spec only", "in second spec only", and "different" col-
umns, prefixed by zero, one and two TAB characters respectively.
Each entry in the "different" column occupies two lines, one from
each specification.

But the first time through, \$CONFIG shouldn't exist yet. Instead, we call mtree with the -c and -k flags followed by various arguments, and write the result to \$CONFIG. Let's do it in a simple way, as shown below. From the Desktop directory I input the command as shown, and write the results to x.txt in \$HOME. As you can see, timestamps for all the Desktop files are there:

 ```/usr/sbin/mtree -c -k time > x.txt > cat ~/x.txt # user: telliott_admin # machine: Toms-Mac-mini.local # tree: /Users/telliott_admin/Desktop # date: Mon Apr 23 17:20:21 2012 # . /set type=file . type=dir time=1335216020.0 .DS_Store time=1335216014.0 .localized time=1333333575.0 notes.txt time=1335210868.0 post.txt time=1335216020.0 script.sh time=1335215494.0 script.sh.txt \ time=1335211653.0 todo.txt time=1335204803.0 .. > ```

Now, if I do it with -f, I get:

 ```> /usr/sbin/mtree -f ~/x.txt . changed modification time expected Mon Apr 23 17:20:20 2012 found Mon Apr 23 17:20:51 2012 .DS_Store changed modification time expected Mon Apr 23 17:20:14 2012 found Mon Apr 23 17:20:55 2012 post.txt changed modification time expected Mon Apr 23 17:20:20 2012 found Mon Apr 23 17:20:51 2012```

I see a change because I'm writing this text into post.txt! For some reason .DS_Store also changed. Next time I'll try to do something a little more useful.

### Channeling Lincoln

I found this on reddit. The post says to consider the first part of the Gettysburg address:

Four score and seven years ago our fathers brought forth upon this continent a new nation, conceived in liberty and dedicated to the proposition that all men are created equal
How long do you think it would take you to manually break the address up into 13 character lines, breaking those lines at appropriate spaces?
I'm not so concerned with the estimation process. This is a fun problem for a beginning Python coder. My solution is below, but please try it yourself first. And yes, I was pleased with my time, but I've done this before. How to handle the beginning and the end is the key. I just make them special cases.

[ UPDATE: You could also just look for the right function in the standard library! ]

 ```> python script.py 10 Four score 9 and seven 13 years ago our 7 fathers 13 brought forth 9 upon this 11 continent a 11 new nation, 12 conceived in 11 liberty and 12 dedicated to 3 the 11 proposition 12 that all men 11 are created 5 equal```

 ```FH = open('data.txt','r') data = FH.read().strip().split() FH.close() pL = list() N = 13 for word in data: assert len(word) <= N tL = [data.pop(0)] for word in data: n = len(' '.join(tL)) n += 1 + len(word) if n > N: pL.append(' '.join(tL)) tL = [word] else: tL.append(word) pL.append(word) for line in pL: print "%2d" % len(line), line ```

### Threats on OS X

It is said that if you know your enemies and know yourself, you will not be imperiled in a hundred battles; if you do not know your enemies but do know yourself, you will win one and lose one; if you do not know your enemies nor yourself, you will be imperiled in every single battle.
--Sun Tzu

Like every other Mac user who is paying attention this week, I've been reading the stories about malware on OS X (e.g. Ars Technica about Flashback here and here).

Charles Edge's Enterprise Mac Security is a good place to start learning about this stuff (although the book's pub date is 2010 and in places it looks quite a bit older). It's also probably not for the average user.

Some terminology: malware is a comprehensive term for various strategies by which bad guys attack users' machines, including (quoting Edge):

Virus: code attached to a file, which can replicate itself and spread to other files on a computer. A simple virus does not spread over a network without transfer of infected files. [Wikipedia has a somewhat different take.]

Worms: spread across networks by taking advantage of security flaws. [An example is OSX.Leap.A.]

Trojan Horse: these embed themselves in an application and activate only when the app is opened. [The app itself is bad, it disguises its actions from the user.]

Rootkit: a software program written to control an operating system.

I've been running OS X since the beginning (when we left System 9 behind), and I've never had anti-virus software. But recent events made me start thinking about being more proactive.

I download and install a lot of non-Apple, non-App Store software, much of it for bioinformatics. I'm well aware that if I type in my admin password for some evil app, there isn't much protection. Yet I run in admin mode all the time, because I find it a pain to be on a plain account. That's the first thing to change if you're not in the same situation.

The most important aspect of my security policy is that I don't have anything on the computer I can't afford to have stolen. I don't have any banking passwords saved. I could be undone by a Keylogger, but that's about it.

My Sharing Prefs panel consists entirely of unchecked boxes. If you want to share stuff you should definitely read Edge's book or another.

I always do a clean install when I upgrade the OS. I have most data backed up to two different hard drives. I pay attention and always update ASAP.

For my desktop, I have automatic login enabled. If someone has physical access to your machine, you're pretty much hosed anyway. (Unless you want to do FileVault).

I've never been crazy about anti-virus software:

• the market share argument is MS propaganda and FUD---OS X with no sharing is hard to crack
• most (all?) Pwn2Own stuff is user account-only, not root (not positive about this, how hard is privilege escalation?)
• anti-virus never protects against zero-day exploits
• the AV guys talk a lot to drum up business, like ambulance-chasing, bottom-feeding blood-sucking lawyers

Nevertheless, the news that the Java exploit somehow got around the requirement for an admin password really worries me. It's not easy to find out how it works (I suppose they don't want copycats). So I'm happy that I don't have Java on my machines running Lion, and I disabled it on the old one.

I've taken two new steps toward heightened security. I downloaded and am evaluating Little Snitch.

So far, I haven't seen any evidence of network connections that wouldn't be expected. However, I was a bit surprised to see that many apps (including all Apple apps) phone home each time they are launched (origin of the phrase).

Also, there are a lot of things (Agents and daemons) doing network stuff that I wasn't aware of---as one example, PubSubAgent.

This agent checks for updates to your RSS feeds, of which I have none (too old fashioned). At some point recently, a Safari update changed the Pref for this to "automatically update articles in Bookmarks bar". It bothered me because I've been methodical about deleting cookies and website data when I'm done. Yet PubSubAgent wanted to connect to feeds.arstechnica.com. I thought, how would they know about that? before understanding that it's a Safari RSS thing. Anyway, I turned off the reader as a test, and we'll see if PubSubAgent doesn't disappear.

I heard about Lingon in Edge's book. It's an app for managing LaunchAgents and (I thought from the book) LaunchDaemons.

Lingon is from Peter Borg, who made (still makes) Smultron. I used Smultron a lot in the days before TextMate, and loved it. So I downloaded the app from the AppStore, but I'm disappointed because it's been crippled. It only works with current user-specific processes and won't show me anything about system stuff.

One thing it did show me is an automatic Adobe Reader update job. I only installed Reader (300 MB!!) for one time when I had to use it. That job is now deleted.

The second new tool that I'm working with is ClamAV, an open source anti-virus tool. I read about it here and got it (with a Mac GUI) from here.

Apparently ClamAV has sucked for OS X in the past, but lately things have changed for the better. Anyway, the scan didn't find anything on my account's files.

What else should I worry about?

My Wi-Fi network is not "closed", the name is visible when you scan for networks. This could be a security liability. I don't know if it's possible to do a brute-force password attack on the Airport---something to read up on. But at least I have a decent password. Luckily, I live in a place where very few people drive by.

My ip address from the isp is supposed to be dynamic, but it's not---hasn't changed forever, as far as I know. Apparently one can force a change by spoofing a new MAC address for the router, but I don't want to do that. Maybe I'll go rummage in the basement for an old router and see what happens.

Rootkits certainly do exist for OS X. There have been ways to look for them (by scanning hashes of system files for alterations) like this Linux tool.

I believe that Code Signing should make standard rootkits obsolete. But maybe not, since this is not for "scripts" and such but full-fledged applications. Perhaps a question to apple.stackexchange is in order.

Finally, I think the advantage has to lie with the offense, in a situation where they cannot themselves be attacked. Structure your defense by asking yourself first: what's the worst that could happen if I did get hacked? And take steps to ameliorate that.

One last thing. The future of OS X is clearly not good for people like me: my computer will be like my iPhone. Apps can do whatever they want on the network, there's so much network activity (and push stuff), that the user can't make sense of it, but in theory an app will be so completely sandboxed it can't touch the rest of the machine. My programs won't be able to just do whatever they want anymore. Like the impact of barbed-wire fencing on the old West.

Sorry for the rambling post..

• ## Sunday, April 22, 2012

### PyObjC Templates for Xcode are back!

I came across a page from a guy who went to the trouble of making and posting new templates for using PyObjC with Xcode 4.

Thanks!

They are on github here. I just followed the instructions:

Copy the File Templates and Project Templates folders to the following path in your home directory, creating any missing intermediate directories if needed:

~/Library/Developer/Xcode/Templates/

I'm a little rusty with Xcode (and it's become more and more like flying the space shuttle or something. Still, I was able to recycle this old project, while changing it just a bit.

One funny thing, I couldn't figure out how to hook up the text field outlet in the old way. I had to use bindings, and of course I did so for the popup as well (both Content and Selected Index).

I don't have a lot of time to fool with Xcode any more, but I do want to check out Greg's blog. Thanks again.

Here's the AppDelegate:

 ```#-*- coding: utf-8 -*- from Foundation import * from AppKit import * class SpeakerAppDelegate(NSObject): TF = objc.ivar('TF') voiceL = objc.ivar('voiceL') nameL = objc.ivar('nameL') selectedIndex = objc.ivar('selectedIndex') notSpeaking = objc.ivar('notSpeaking') pretext = 'Hi, my name is ' def init(self): s = NSSpeechSynthesizer.alloc().initWithVoice_(None) self.speaker = s self.speaker.setDelegate_(self) self.setVoices() self.notSpeaking = True self.setSelectedIndex_(0) return self def setTF_(self, value): if value != self.TF: self.TF = value def setSelectedIndex_(self, value): if value != self.selectedIndex: self.selectedIndex = value name = self.nameL[self.selectedIndex] self.setTF_(self.pretext + name) def setVoices(self): L = NSSpeechSynthesizer.availableVoices() self.voiceL = L DL = [NSSpeechSynthesizer.attributesForVoice_(v) for v in L] nL = [D.objectForKey_('VoiceName') for D in DL] self.nameL = NSMutableArray.arrayWithArray_(nL) NSLog("%s" % self.nameL) @objc.IBAction def speak_(self,sender): i = self.selectedIndex NSLog("%i Say: %s" % (i, self.TF)) self.speaker.setVoice_(self.voiceL[i]) self.notSpeaking = False self.speaker.startSpeakingString_(self.TF) pass def speechSynthesizer_didFinishSpeaking_(self,speechSyn,flag): NSLog("didFinishSpeaking_") self.notSpeaking = flag ```

## Saturday, April 21, 2012

### Damn you, autocorrect

Apparently, I'm not the only one who has problems with autocorrect.

At some point (Lion?), Safari acquired this ability. I keep seeing weird errors in my posts that I have no idea where they came from. So I turned it off system-wide under Language and Text Prefs. I also unchecked the box (which is checked in the screenshot) for replacing (c) with the symbol ©, etc.

### Testing

It seems improbable, but it is a fact that on the first exam in some courses, a few students may score lower than expected for guessing at random. We usually see some scores of less than 20 out of 100 for an exam whose questions have five possible answers ABCDE.

We can turn this into an illustration of statistical significance. How high must a score be before one can dismiss the null hypothesis (no knowledge of the subject) with a p-value of 0.05?

We have a Bernouli trial with p = 0.2.

The expected value or mean is np. There is a beautifully simple derivation of the mean and variance in the wikipedia article, the variance is np(1-p). For our example, the mean is 20 and the variance is 100 x 0.2 x 0.8, so the standard deviation is √16 = 4.

For a large class (N = 100) of randomly answering students, the distribution appraches normal, and we expect that 2.5% of such students will score at or above two standard deviations above the mean. To make our rate for a type I error less than 0.05, we should require a score of 28 or higher.

I wrote a short script to simulate this. Each student takes a test with 100 questions, and each trial consists of a class of 100 students. We count the number of students with scores at or above a threshold T. The results for 10 trials are printed, and the results for 100 trials are averaged.

A few quick points: the mean appears to fall between 20 and 21. I'm sure there's a simple explanation but it eludes me at the moment. The score that 2.5% of students regularly exceed is 28.

This is a great example of the danger of multiple hypothesis testing!

One last thing is that the code is not very smart. What I should have done was extract the statistics for each value of T for each trial, rather than repeating the trial for each value of T. But it runs fast enough, and this way I don't need a data structure to remember the results.

 ```> python script.py 20: 56 43 54 57 57 54 57 46 45 64 55.0 21: 39 48 42 44 39 44 48 48 50 48 43.5 22: 43 32 39 30 38 35 36 42 33 41 34.6 23: 40 29 35 31 27 19 30 20 31 25 26.0 24: 19 18 24 26 20 22 21 17 15 24 19.2 25: 11 14 8 8 13 11 12 15 17 13 12.9 26: 12 6 6 6 11 4 8 7 10 6 9.2 27: 7 4 6 5 8 7 5 7 7 7 5.9 28: 3 2 4 2 3 1 2 3 5 1 3.6 29: 4 4 2 2 2 1 3 1 0 1 2.2 30: 1 1 1 0 1 1 1 0 0 1 0.9 31: 4 0 1 0 0 0 0 0 0 0 0.6 32: 0 0 0 0 1 0 0 1 0 0 0.3 33: 0 0 0 0 0 1 0 0 0 1 0.1 34: 0 0 0 0 0 0 0 0 1 0 0.1 ```

script.py
 ```import random def student(N=100, p=0.2): L = [random.random() <= p for i in range(N)] return sum(L) def trial(T): L = [student() for i in range(100)] return sum([1 for n in L if n >= T]) def mean(L): return 1.0*sum(L)/len(L) for T in range(20,35): L = list() print '%2d: ' % T, for i in range(100): L.append(trial(T)) if i < 10: print '%2d' % L[-1], print '%4.1f' % mean(L) ```

## Friday, April 20, 2012

### Hex puzzle [-1]

Here is a fourth (and final) post in the saga of the "Let us have hex" puzzle. (Actual puzzle, reddit comments, and previous posts here, here and here).

Although I know the redditeers have already solved the problem, I had hoped to work my way through it more systematically. But I ran into a constraint I can't get around very easily (or perhaps at all), which is that the available data is not complete. It's not enough to yield a systematic solution.

An important feature of the puzzle is that visitors to the puzzle page get served 100 digit hex strings that are refreshed to a new value slowly (maybe every half hour), and also might (at least theoretically) be restricted by ip address, so that a social venue like reddit is critical to gathering all the data. But the reddit faction quit too soon, after they'd guessed the answer.

The total number of strings I found in the reddit comments is 340. Let me show two demonstrations that the complete data is not there. First, recall that the base64 encodings have "codons" of length 4. So two adjacent 50-character base64 strings that are not correctly aligned (because they abut without overlap, and were decoded individually), might be missing as many as 4 base64 characters at the joint. This would affect in turn at most 2 adjacent characters in the C or PHP program, since the ratio is 4:3.

A simple decoding of string i = 6 gives:

< 73; i++) { pri
{ echo \$a[\$i];

The first line is C and the second PHP. The problem is that there's nothing in the known data that corresponds to what likely comes next---there's no "echo \$b[\$i];" in the data. It's just missing. I suppose there could be a series of very short overlaps that would build it, but it seems pretty unsporting. Similarly, the formatting of the C output is not there.

More conclusive evidence comes from an analysis of the solution. The first word is "Design" and it is derived from the C program, which has this fragment:

int a[] = {4, 4, 6, 5, 7, 3, 6, 9, 6, 7..

If we combine these as hex digits and then print the ASCII equivalent we get:

 ```>>> L = ['44','65','73','69','67'] >>> L = ['0x' + s for s in L] >>> L = [chr(int(s,base=16)) for s in L] >>> L ['D', 'e', 's', 'i', 'g']```

We need the "n" in design which is '0x6e' in hex:

 ```>>> hex(ord('n')) '0x6e'```

But there is no 'e' in the data with a leading 6. The overlap is not there.

I think I'm done now. I had fun. I wrote a lot of code to try to deal with the challenge of how to assemble a bunch of strings where the candidate strings with a large overlap might be misleading. I was prepared to do the thing by hand, as shown in the screenshot. But there is not enough data, you have to guess. Here is what I was reduced to:

## Sunday, April 15, 2012

### Hex puzzle (3)

I've been working on the hex puzzle (previous posts here and here). We have a list of 340 unqiue lines of data, each 100 characters in length. The first and last lines have been given to us as i = 137 and i = 61, respectively.

The data file ('reddit_hex.txt') is on Dropbox here.

The data seem to be hexadecimal bytes with an unusual distribution that leads to the first part of the decoding: these are bytes that represent ASCII characters in the set A..Za..z0..1. This suggests the possibility that the given data are a hex representation of base64 (although the last two characters '/' and '+' were not found).

Unfortunately, I haven't yet been able to stitch the lines together in a way that generates a 1936 hex digit mesage, with the correct ends, to be decoded. Apparently I go wrong at some points in the assembly, probably because of repeats in the data stream, although perhaps I just messed up.

It might be a disappointment to you, but I'm going to try using another critical clue gleaned from the comments to the reddit post. After converting the base64 data to actual ASCII characters, it turns out that there are two intertwined messages: the alternating bytes form a C program and a PHP program. So I wrote a script that uses all these clues to analyze each of the lines of the original data.

Let's look at some sample output, say the first entry (i = 137):

 ```> python demo.py 137 137 497a78705032357759326873634855675a43526c595341675044317a494852685a484a70636d39684c6e6c6f4b4434354369 I z x p P 2 5 w Y 2 h s c H U g Z C R l Y S A g P D 1 z I H R h Z H J p c m 9 h L n l o K D 4 5 C i IzxpP25wY2hscHUgZCRlYSAgPD1zIHRhZHJpcm9hLnloKD45Ci 0 ('#include ', '

What's going on here is that we have 100 hexadecimal digits or 50 bytes in the original data. Each byte can be converted to its ASCII representation as shown in the second and third lines.

 ```>>> chr(int('0x49',base=16)) 'I'```

We can do the whole line at once with:

 ```>>> import binascii >>> s = '497a78705032357759326873634855675a43526c595341675044317a494852685a484a70636d39684c6e6c6f4b4434354369' >>> binascii.unhexlify(s) 'IzxpP25wY2hscHUgZCRlYSAgPD1zIHRhZHJpcm9hLnloKD45Ci'```

The base64 data decoding takes 4 characters at a time. For example:

 ```>>> import base64 >>> d = base64.b64decode >>> d('Izxp') '#>> s = 'IzxpP25wY2hscHUgZCRlYSAgPD1zIHRhZHJpcm9hLnloKD45Ci' >>> d(s[:-2]) '#9'```

Because base64 is a 4-letter code, there are 4 possible frames, and we only know the frame for the first line of data. It turns out that for this data, for each line, only one of the frames yields just ASCII characters. In the script below, we filter out decodings like this one:

'\xd0\xb0\x80\xc8\xb1\x80\xc0\xb0\x80\xd8\xb0\x80\xc4\xb1\x80\xd8\xb0\x81'

by turning the whole line into an array of ints and then asking whether any is > 128.

 ```def filter(line): B = [struct.unpack('B',b)[0] for b in line] return not max(B) > 128```

After conversion from base64 to ASCII, we build strings from the alternate bytes, resulting in the output line:

0 ('#include <stdio.h>', '<?php \$a = array(9')

This indicates that the two strings were derived from frame 0. It's obvious that the first one is the beginning of a C program, and the second certainly looks like PHP to me.

Note that we can't just use the data as is, because we have thrown away bytes at the joints. Decoding in frame 0, we lose the 'Ci' at the end of the base64 data.

Also, consider i = 8:

 ```> python demo.py 8 8 357759326873634855675a43526c595341675044317a494852685a484a70636d39684c6e6c6f4b4434354369787049473432 5 w Y 2 h s c H U g Z C R l Y S A g P D 1 z I H R h Z H J p c m 9 h L n l o K D 4 5 C i x p I G 4 2 8 5wY2hscHUgZCRlYSAgPD1zIHRhZHJpcm9hLnloKD45CixpIG42 2 ('clude \nin', 'hp \$a = array(9, 6')```

Notice that 'clude <stdio.h>\nin' looks like a continuation of '#include <stdio.h>' from the first line.

 ```>>> import base64 >>> d = base64.b64decode >>> d('Y2hscHUg') 'chlpu '```

The 'cl' of 'clude' are coming from the base64 'Y2hs'. It's reassuring that we can align 0 and 8:

 ```IzxpP25wY2hscHUgZCRlYSAgPD1zIHRhZHJpcm9hLnloKD45Ci 5wY2hscHUgZCRlYSAgPD1zIHRhZHJpcm9hLnloKD45CixpIG42```

So the plan is, to stitch data lines together one at a time, and examine the result to be sure it makes sense after going through base64 decoding and de-interleaving.

demo.py
 ```import sys, binascii, base64, struct def load_data(fn): FH = open(fn,'r') data = FH.read().strip() FH.close() return data.strip().split('\n') def untangle(s): s1 = ''.join([s[i] for i in range(0,len(s),2)]) s2 = ''.join([s[i] for i in range(1,len(s),2)]) return s1,s2 def filter(line): B = [struct.unpack('B',b)[0] for b in line] return not max(B) > 128 def process(s): d = base64.b64decode rL = list() for i,j in [(0,-2),(1,-1),(2,len(s)),(3,-3)]: line = d(s[i:j]) if filter(line): rL.append((i,untangle(line))) return rL def analyze_line(i,L): D = {'i':i} s = L[i] D['hex'] = s s = binascii.unhexlify(s) D['b64'] = s result = process(s) assert len(result) == 1 j, t = result[0] D['frame'] = j D['t'] = t return D def format_dict(D): s1 = '%3d ' % D['i'] + D['hex'] s2 = ' ' + ' '.join(list(D['b64'])) s3 = '%3d ' % D['i'] + D['b64'] s4 = '%3d ' % D['frame'] + str(D['t']) return '\n'.join([s1,s2,s3,s4]) L = load_data('reddit_hex.txt') DL = [analyze_line(i,L) for i in range(len(L))] try: i = int(sys.argv[1]) text = format_dict(DL[i]) print text, '\n' except IndexError: for D in DL: text = format_dict(D) print text, '\n' ```

## Tuesday, April 10, 2012

### Hex puzzle (2)

Let's continue with the puzzle introduced in the last post (here).

The next step is to stitch the lines together to form the complete text. If they contained random bytes, that would be relatively easy. There are 65536 possible two-byte combinations. We have a total of 340 * 25 = 8500 two-byte (4 hex character) combinations in our data, so the expected number of times any combination should appear in random data is much less than 1.

However, the data is not random, instead, it is highly skewed. Simple modifications to count.py (from last time) show a total of only 370 two-byte combinations in the data (i.e. 65166 of all possible values are not observed), and the top 20 have counts ranging from 464 down to 72.

 ```464 4341 CA 331 4943 IC 309 4167 Ag 281 4c43 LC 234 734c sL 174 4173 As 154 4377 Cw 140 7773 ws 139 7349 sI 139 674d gM 139 674c gL 137 4c44 LD 126 674e gN 104 6749 gI 103 7767 wg 95 4e69 Ni 89 5973 Ys 78 4944 ID 76 4947 IG 72 4e6a Nj ```

In the table above, the first column is the count, the second and third are the hex and ASCII representations.

That compares with a maximum count of 3-4 for random data, and an observed total of 8000 different values on an average run with random data.

This suggests that we should undertake some data exploration to see whether the repetitive character of the data will cause difficulties for the assembly.

The core operation is to take an individual entry at index i in the data (the query), and depending on a parameter (tlen) for the number of characters to choose at the very end of the query, make a substring sequence t, the target to search for. Then, we go through all the other entries looking for a match containing t, and when one is found we save the index of the query and the match as well as the position k where the target t begins in the match string.

The quality of these matches can be assessed in various ways. We expect that the sequence of each match string upstream of the target should be exactly the same as the query. We calculate the offset o where the beginning of the match string should be found in the query. Since this depends on tlen (o = len(L[i]) - k - n) we calculate it for each hit and save it with the other parameters. The offset is also useful for printing the actual alignments in show.

A second quality test is to examine the match strings downstream of the target. In the region of overlap, we expect they should be identical as well. A failure of either test would suggest that the data is causing trouble for our simple search algorithm.

As mentioned last time, the observed hex digits comprise all (and only) those which encode ASCII characters 0..1A..Za..z. Rather than print out strings of 100 characters, it will simplify the output to convert the data to ASCII first. We're assuming that this is what we need to do with the data, but I peeked at the answer on reddit, so it's safe to go ahead.

python convert.py > data.mod.txt
 ```import binascii from utils import load_data L = load_data('reddit.txt') for line in L: print binascii.unhexlify(line)```

Note the extremely repetitive nature of the data by comparing some adjacent output lines:

 ```A1NiAsNiAsMyAsMSA0NiwsICAyJyxmICcwLCwgIDc2KSw7ICAx A1NywsICAyNSwsICA0NiwsICAyJyxmICcwLCwgIDc2KSw7ICAx AgLDUgLDcgLDMgLDAgLDAgLDcgLDIgLDIgLDYgLDYgLDEgNTQs AgLDUgLDcgLDMgLDAgLDAgLDcgLDYgLDIgLDYgLDYgLDEgNTQs xpIG42dCwgIG00YSxpIG42KCwpICAzeywgICAyICwgICA2ICxp xpIG43dCwgIGE3WyxdICA3PSwgIHswNCwsICA2NCwsICAxNiws```

However, for a well-chosen query (one with a rare two-byte combination as the target) and a long enough target length, it's easy to find matches that appear correct. In the output below, we examine the string found at index 0 in the data as the query. The analysis finds 25 matches, data strings where the terminal 4 characters appear somewhere in the match, and all of those matches are perfect ones whether looking upstream and comparing against the query, or looking downstream in each one (beyond the target) and comparing with each other, as shown in the second half of the output.

We'll exercise these routines a bit more next time. I want to see how big to make the target size, and also whether we should exclude some strings from the preliminary analysis. But overall, it looks like we might be OK.

 ```> python search.py i= 0 : 25 matches found 0NiwsICAyJyxmICcwLCwgIDc2KSw7ICAxJCxiICA2PSwgIGExc GExc NiwsICAyJyxmICcwLCwgIDc2KSw7ICAxJCxiICA2PSwgIGExc iwsICAyJyxmICcwLCwgIDc2KSw7ICAxJCxiICA2PSwgIGExc CAyJyxmICcwLCwgIDc2KSw7ICAxJCxiICA2PSwgIGExc AyJyxmICcwLCwgIDc2KSw7ICAxJCxiICA2PSwgIGExc xmICcwLCwgIDc2KSw7ICAxJCxiICA2PSwgIGExc mICcwLCwgIDc2KSw7ICAxJCxiICA2PSwgIGExc CcwLCwgIDc2KSw7ICAxJCxiICA2PSwgIGExc cwLCwgIDc2KSw7ICAxJCxiICA2PSwgIGExc wLCwgIDc2KSw7ICAxJCxiICA2PSwgIGExc LCwgIDc2KSw7ICAxJCxiICA2PSwgIGExc CwgIDc2KSw7ICAxJCxiICA2PSwgIGExc wgIDc2KSw7ICAxJCxiICA2PSwgIGExc gIDc2KSw7ICAxJCxiICA2PSwgIGExc CAxJCxiICA2PSwgIGExc AxJCxiICA2PSwgIGExc xJCxiICA2PSwgIGExc JCxiICA2PSwgIGExc CxiICA2PSwgIGExc xiICA2PSwgIGExc A2PSwgIGExc 2PSwgIGExc SwgIGExc gIGExc IGExc GExc no mismatches upstream GExc j jR jRyLGE jRyLGEg jRyLGEgeTYo jRyLGEgeTYoL jRyLGEgeTYoLDc jRyLGEgeTYoLDcg jRyLGEgeTYoLDcgL jRyLGEgeTYoLDcgLD jRyLGEgeTYoLDcgLDQ jRyLGEgeTYoLDcgLDQg jRyLGEgeTYoLDcgLDQgL jRyLGEgeTYoLDcgLDQgLDkgLDIgLDY jRyLGEgeTYoLDcgLDQgLDkgLDIgLDYg jRyLGEgeTYoLDcgLDQgLDkgLDIgLDYgL jRyLGEgeTYoLDcgLDQgLDkgLDIgLDYgLD jRyLGEgeTYoLDcgLDQgLDkgLDIgLDYgLDA jRyLGEgeTYoLDcgLDQgLDkgLDIgLDYgLDAg jRyLGEgeTYoLDcgLDQgLDkgLDIgLDYgLDAgLDUg jRyLGEgeTYoLDcgLDQgLDkgLDIgLDYgLDAgLDUgL jRyLGEgeTYoLDcgLDQgLDkgLDIgLDYgLDAgLDUgLDc jRyLGEgeTYoLDcgLDQgLDkgLDIgLDYgLDAgLDUgLDcgL jRyLGEgeTYoLDcgLDQgLDkgLDIgLDYgLDAgLDUgLDcgLD jRyLGEgeTYoLDcgLDQgLDkgLDIgLDYgLDAgLDUgLDcgLDM no mismatches downstream```

search.py
 ```from utils import load_data # tlen is minimum overlap # functions deal with one string at a time # do not assume upstream matches # e is search string, f is potential match def one_search(i,L,tlen): # search result has associated tlen oL = list() e = L[i] t = e[-tlen:] for j,f in enumerate(L): if i == j or not t in f: continue # offset is pos in e where f should start # not guaranteed to match except over t offset = len(e) - f.index(t) - tlen oL.append((j,offset)) oL.sort(key=lambda item:item[1]) return oL # mmL is a list of mismatched strings # ummL contains upstream mismatches def test_upstream(i,L,oL,ummL): e = L[i] for j,o in oL: f = L[j] if not f.startswith(e[o:]): if i < j: ummL.append((i,j)) else: ummL.append((j,i)) def trim_downstream(oL,L,tlen): pL = list() for j,o in oL: f = L[j] k = len(f) - o pL.append((j,f[k:])) return pL def test_downstream(pL,dmmL): mismatches = 0 for i,s1 in pL: for j, s2 in pL: if s1 == s2 or len(s1) < len(s2): continue if not s1.startswith(s2): if i < j: dmmL.append((i,j)) else: dmmL.append((j,i)) if __name__ == '__main__': L = load_data('data.mod.txt') tlen = 4 i = 0 e = L[i] oL = one_search(i,L,tlen) print 'i=', i, ':', len(oL), 'matches found' print e print ' ' * (len(e)-tlen) + e[-tlen:] print for j,o in oL: k = len(e) - o print ' '*o + L[j][:k] ummL = list() test_upstream(i,L,oL,ummL) if ummL: print len(ummL), 'mismatches' else: print 'no mismatches upstream\n' dmmL = list() print e[-tlen:] pL = trim_downstream(oL,L,tlen) for item in pL: print ' '*tlen + item[1] test_downstream(pL,dmmL) if dmmL: print len(dmmL), 'mismatches' else: print 'no mismatches downstream' ```

## Monday, April 9, 2012

### Hex puzzle (1)

Last Saturday, an interesting puzzle was posted on reddit/r/programming. I'm having trouble finding the original post, but the actual puzzle page is here, and the (complete) comments to the post are here.

As you can see from the screenshot or the link

the puzzle is a message consisting of 1936 hex digits, and upon further manipulation it promises to yield the answer. You can actually follow the link to see the answer if you're impatient. I wasn't able to compete (I had a party to cook for, and could not have gone as fast as these guys anyway), but I'd still like to understand how they got there.

Apparently, we need to assemble a number of overlapping 100 digit pieces of a 1936 digit message and then figure out what it means. The pieces were served up to different visitors to the site and collected as comments on reddit. The first objective is to recover the observed data. I believe reddit must have an API for this, but I don't know it. What I did was to save the source for the page with all 410 comments (in the file comments.txt) and then run scrape.py below.

I count 251 strings of 100 characters, all in "0..9a..f".

According to the instructions, the first and last strings are marked, and these were recovered and posted on reddit:

497a78705032357759326873634855675a43526c595341675044317a494852685a484a70636d39684c6e6c6f4b4434354369 //first 100
4270494630674f3349675a5831304948566c636d4e75614342764d4341374a79426c494363674f7941674944393950676f3d //last 100

[UPDATE: There is a link to the complete (known) data from reddit to github here. I count a total of 340 strings, substantially more than what I've been working with in early attempts. This should help with the assembly problem. ]

The data clearly has substantial structure, as indicated by the screenshot from a "find" search for the digit "4" in TextEdit.

Since only hexadecimal digits are present it suggests that this might actually be hexadecimal. It's interesting that not all digits are equally represented. In particular, only those bytes (two hex digits) which encode ASCII "a..zA..Z0..9" are present (although the distribution is decidedly odd). So it seems pretty likely that we just need to decode the hex to ASCII and then figure out what it means.

However, there is an assembly problem that comes first.

Output from count.py (listing at the end):

 ```> python count.py 54 30 48 0 0 40 31 49 1 1 256 32 50 2 2 71 33 51 3 3 58 34 52 4 4 82 35 53 5 5 64 37 55 7 7 39 38 56 8 8 63 39 57 9 9 1070 41 65 A A 133 42 66 B B 1102 43 67 C C 414 44 68 D D 111 45 69 E E 38 46 70 F F 169 47 71 G G 101 48 72 H H 812 49 73 I I 151 4a 74 J J 132 4b 75 K K 667 4c 76 L L 399 4d 77 M M 532 4e 78 N N 83 4f 79 O O 82 50 80 P P 67 51 81 Q Q 65 52 82 R R 286 53 83 S S 58 54 84 T T 51 55 85 U U 12 56 86 V V 56 57 87 W W 31 58 88 X X 315 59 89 Y Y 92 5a 90 Z Z 16 61 97 a a 13 62 98 b b 278 63 99 c c 61 64 100 d d 58 65 101 e e 8 66 102 f f 938 67 103 g g 57 68 104 h h 247 69 105 i i 201 6a 106 j j 116 6b 107 k k 62 6c 108 l l 64 6d 109 m m 97 6e 110 n n 61 6f 111 o o 209 70 112 p p 896 73 115 s s 25 74 116 t t 23 75 117 u u 15 76 118 v v 532 77 119 w w 211 78 120 x x 414 79 121 y y 192 7a 122 z z```

scrape.py
 ```# python scrape.py > data.txt fn = 'comments.txt' FH = open(fn) data = FH.read() FH.close() L = list() i = 0 while True: j = data.find('>',i) j += 1 i=j k = j + 100 if k > len(data): break if not data[k] == '<': continue w = data[j:k] if ' ' in w or '>' in w or '<' in w: continue for c in w: assert c in '0123456789abcdef' L.append(w) if i == 0: break L = list(set(L)) L.sort() for w in L: print w```
count.py
 ```import binascii from collections import Counter def load_data(): fn = 'data.txt' FH = open(fn,'r') data = FH.read().strip() FH.close() L = data.split('\n') return L L = load_data() n = 2 cL = list() for line in L: for i in range(0,len(line)-n+1,2): cL.append(line[i:i+n]) C = Counter(cL) for h,n in sorted(C.most_common()): i = int(h, base=16) print '%4d %s %4d %s %s' %\ (n, h, i, chr(i), binascii.unhexlify(h))```

## Saturday, April 7, 2012

A collection of links to tutorials, blog posts and software that look interesting, which I hope to explore soon:

• Wikibooks: Reverse Engineering/Mac OS X
• Reverse Engineering Mac OS X
• Learn Python The Hard Way
• Learn C The Hard Way
• Apple docs on shell scripting