Monday, November 8, 2010

Code

This is the sixth in a series of posts about Student's t test. The previous posts are here, here, here, here, and here.

In this post I'm going to talk about how I would test my home-grown functions to do the t-test in Python. I'm only going to do this for the one-sample test, because I'm running out of energy. But you could easily extend this to the other functions.

We could check using SciPy if it is installed, and that would be easy. But I'm going to use R, partly because it's the gold standard for statistics, and partly because it's a useful technique that I've used in my real work. While we could use RPy as I mentioned the other day, here I am going to use R to generate the sample data, run the test of interest and then write the results to disk.

What we're going to do is run R in batch mode. If it gives an error, which frequently happens when building such a test, we want to show the error. We also clean up extra files when we're done. The code to do this has been added to `utils.py` and is given below. It should be self-explanatory.

When we run R this way, we feed it a text file with the commands to run. This file is constructed by the function `write_R_code`. It's not pretty, but I hope you get the idea. Perhaps a better method, which I've also used, is to write the code to a text file, then load it into Python, substitute the file names and specific functions etc., then write that to disk as the file to give to R.

In the last step, we load the results that R has given us and evaluate how well our functions do on the same data. As you can see, we match very well as a rule, but have rounding errors at the extremes of the t-distribution. I believe this could be fixed by extending the range over which we built the distribution (see here). I think it looks pretty good.

[UPDATE: Just to make it clear what we did: we do 1000 runs and look at the details for the first five, then test the rest for deviation from the R results by more than ε = 0.0001. ]

Output:

 `run(1626, 0)clean.RCode removed.RCode.Rout removed.RData removed.RHistory not foundtest results:N = 1000R gave: t = -1.858651I get: t = -1.85865011386R gave: t = -4.250644I get: t = -4.25063601226R gave: t = -1.743078I get: t = -1.74307933578R gave: t = 0.1286087I get: t = 0.128612751636R gave: t = -9.36424I get: t = -9.36429217504error with:[ 8.946167 8.941678 8.940817] -1843.026 -1843.23012185error with:[ 8.895525 8.686258 8.761813] -52.60937 -52.6095381366error with:[ 10.54477 10.42989 10.66126] -21.77912 -21.7796812829error with:[ 11.05575 10.89208 10.68675] -10.50883 -10.5090054298error with:[ 10.02618 10.07471 10.16116] -48.45138 -48.4527009573`

`R_test.py`

 `import numpy as npfrom t_test import one_sample_timport utilsdef write_R_code(D): L = [ 'fn <- file("' + D['results_fn'] + '","w")', 'n = ' + str(D['n']), 'mu = ' + str(D['mu']), 'sigma = ' + str(D['sigma']), 'for (i in 1:' + str(D['N']) + ') {', 'A = rnorm(n,mu,sigma)', 'result = ' + D['func'] + str(D['mu_est']) + ')', 'cat(A,file = fn,sep = "\n")', 'cat(result\$statistic,file = fn,sep="\n")', 'cat("\n",file = fn)', '}', 'close(fn)' ] FH = open(D['code_fn'], 'w') FH.write('\n'.join(L)) FH.close() def compare_to_my_result(L): print 'test results:' print 'N = ', len(L) for i,e in enumerate(L): A = np.array(e[:-1]) result = one_sample_t(A,mu=12) if i < 5: print 'R gave: t =', e[-1] print 'I get: t =', result[0] print else: if e[-1] - result[0] > 0.0001: print 'error with:' print A, e[-1], result[0] printif __name__ == '__main__': directory = '/Users/telliott_admin/Desktop/' D = { 'code_fn':directory + '.RCode', 'results_fn':directory + 'results.txt', 'func':'t.test(A,alternative="less",mu=', 'n':3,'mu':10,'sigma':2,'mu_est':12,'N':1000 } write_R_code(D) utils.runR(D['code_fn'],v=True) utils.cleanup_R_files(D['code_fn'],directory,v=True) L = utils.load_data(D['results_fn']) compare_to_my_result(L)`

`utils.py`

 `from __future__ import divisionimport os,subprocessimport numpy as npdef unbiased_var(X): n = len(X) sample_SS = sum(X**2) - sum(X)**2 / n return sample_SS/ (n-1) def unbiased_std(X): return np.sqrt(unbiased_var(X))def runR(code_fn,v=False): if v: print 'run' cmd = 'R CMD BATCH ' + code_fn p = subprocess.Popen(cmd, shell=True) sts = os.waitpid(p.pid, 0) if v: print sts if sts[1] != 0: show_R_error()def show_R_error(): fn = directory + '.Rcode.Rout' FH = open(fn, 'r') msg = FH.read() FH.close() print '\n'.join(msg.split('\n')[-3:]).strip()def cleanup_R_files(code_fn,directory,v=False): if v: print 'clean' L = ['.RCode.Rout','.RData', '.RHistory'] L = [code_fn] + [directory + e for e in L] for fn in L: try: os.remove(fn) if v: print fn.split('/')[-1], 'removed' except OSError: if v: print fn.split('/')[-1], 'not found' printdef load_data(results_fn): def convert(s): L = s.split('\n') return [float(n) for n in L] FH = open(results_fn, 'r') data = FH.read().strip() FH.close() L = data.split('\n\n') rL = [convert(s) for s in L] return rL`