## Friday, November 5, 2010

### Student's t-test again 3

This is the third in a series of posts about Student's t test. The previous posts are here and here.

While looking at the code in PyCogent I noticed a reference to "Cephes." The Cephes Mathematical Library (here)

It contains lots and lots of functions (like the t distribution and the incomplete beta and so on). The analogous functions in PyCogent appear to be just straight translations of the C code.

I found a Python module that makes the Cephes library available directly. It was put together by Michiel de Hoon (here, old home page here), when he was at the University of Tokyo. I've also looked at his clustering software before (here).

According to the last page, he's currently at Columbia University. Just two seconds ago, I saw this, some kind of manual he's written for doing statistics in Python. Whoaah! Looks like it should be very interesting!

Anyhow, I downloaded the source and did

 `\$python setup.py configERROR: unknown if big-endian or little-endianedit the config file before compiling`

The config file looks like this

`/* config.h file created by setup.py script Fri Nov 5 15:55:52 2010 * run on darwin */ #define UNK 1#define BIGENDIAN UNKNOWN /* replace with 0 or 1 */`

I modified it appropriately and then did as the instructions say:

 `python setup.py buildpython setup.py install`

test by:

 `>>> from transcendental import ndtr >>> ndtr(2.) - ndtr(-2.)0.95449973610364158`

I figured the C version should be a lot faster, and it is:

 `\$ python -m timeit -s 'from transcendental import stdtr; import random' 't = random.random(); stdtr(3,t)'1000000 loops, best of 3: 1.26 usec per loop\$ python -m timeit -s 'import distribution; from distribution import stdtr; import random' 't = random.random(); stdtr(3,t)'10000 loops, best of 3: 24.2 usec per loop`

But ... it doesn't seem to be the slow step in the t test code.

### Calculating the cdf and testing the result

I looked up the definitions for the t distribution's pdf and cdf in wikipedia (here).

I translated the x-dependent part into Python and "vectorized" and used it to constuct a pdf. It's important to set really wide limits for the vector in this step, otherwise numerous small probability values are discarded, which then leads to small errors in the eventual product.

Since the normalization constants (gamma functions) depend on df but not on x, I was able to normalize the pdf simply by doing:

`pdf /= sum(pdf)`

I also used a very small step size in the Numpy array. The second section of the code listing ends by printing values from the distributions for whole numbers between -5 and 5:

 `pdf-5.000 0.00422-4.000 0.00916-3.000 0.02297-2.000 0.06751-1.000 0.206750.000 0.367551.000 0.206752.000 0.067513.000 0.022974.000 0.009165.000 0.00422calculated cdf-5.000 0.00770-4.000 0.01401-3.000 0.02884-2.000 0.06970-1.000 0.195600.000 0.500181.000 0.804602.000 0.930373.000 0.971184.000 0.986005.000 0.99231`

As a test, I show that I get very similar values using `stdtr` from either PyCogent or Cephes:

 `stdtr from transcendental-5.000 0.00770-4.000 0.01400-3.000 0.02883-2.000 0.06966-1.000 0.195500.000 0.500001.000 0.804502.000 0.930343.000 0.971174.000 0.986005.000 0.99230stdtr from PyCogent-5.000 0.00770-4.000 0.01400-3.000 0.02883-2.000 0.06966-1.000 0.195500.000 0.500001.000 0.804502.000 0.930343.000 0.971174.000 0.986005.000 0.99230`

And that all matches what I get from R. So I'm pretty confident that I have the distributions correct and the functions are working as they should.

 `> A = seq(-5,5,by=1)> pt(A,3) [1] 0.007696219 0.014004228 0.028834443 0.069662984 [5] 0.195501109 0.500000000 0.804498891 0.930337016 [9] 0.971165557 0.985995772 0.992303781`

code listing:

 `import numpy as npimport distribution # also imports specialimport transcendentaldx = 0.001X = np.arange(-100,100+dx,dx)@np.vectorizedef p(df,x): base = 1 + x**2/df exponent = (df+1)/2 return base**(-exponent)df = 3pdf = p(df,X)# normalization depends on df but not x# so this workspdf /= sum(pdf)#=============================================def show(F,multiplier=1): N = 95000 for n in range(N,len(X)-N,1000): t = X[n], multiplier*F[n] print '%5.3f %6.5f' % t print# we include a multiplier for the pdf onlyprint 'pdf' show(pdf,multiplier=1000)cdf = [pdf[0]]for n in pdf[1:]: cdf.append(n + cdf[-1])print 'calculated cdf'show(cdf)#=============================================@np.vectorizedef f(x): return transcendental.stdtr(df,x)cdf2 = f(X)print 'stdtr from transcendental'show(cdf2)@np.vectorizedef f(x): return distribution.stdtr(df,x)cdf3 = f(X)print 'stdtr from PyCogent'show(cdf3)`