Python for Bioinformatics: Student's t-test again 3

Friday, November 5, 2010

Student's t-test again 3

This is the third in a series of posts about Student's t test. The previous posts are here and here.

While looking at the code in PyCogent I noticed a reference to "Cephes." The Cephes Mathematical Library (here)

It contains lots and lots of functions (like the t distribution and the incomplete beta and so on). The analogous functions in PyCogent appear to be just straight translations of the C code.

I found a Python module that makes the Cephes library available directly. It was put together by Michiel de Hoon (here, old home page here), when he was at the University of Tokyo. I've also looked at his clustering software before (here).

According to the last page, he's currently at Columbia University. Just two seconds ago, I saw this, some kind of manual he's written for doing statistics in Python. Whoaah! Looks like it should be very interesting!

Anyhow, I downloaded the source and did

$python setup.py config

ERROR: unknown if big-endian or little-endian
edit the config file before compiling

The config file looks like this

/* config.h file created by setup.py script Fri Nov  5 15:55:52 2010
 * run on darwin */ 
#define UNK 1
#define BIGENDIAN UNKNOWN /* replace with 0 or 1 */

I modified it appropriately and then did as the instructions say:

python setup.py build
python setup.py install

test by:

>>> from transcendental import ndtr 
>>> ndtr(2.) - ndtr(-2.)
0.95449973610364158

I figured the C version should be a lot faster, and it is:

$ python -m timeit -s 'from transcendental import stdtr;  import random' 't = random.random();  stdtr(3,t)'
1000000 loops, best of 3: 1.26 usec per loop

$ python -m timeit -s 'import distribution;  from distribution import stdtr;  import random' 't = random.random();  stdtr(3,t)'
10000 loops, best of 3: 24.2 usec per loop

But ... it doesn't seem to be the slow step in the t test code.

Calculating the cdf and testing the result

I looked up the definitions for the t distribution's pdf and cdf in wikipedia (here).

I translated the x-dependent part into Python and "vectorized" and used it to constuct a pdf. It's important to set really wide limits for the vector in this step, otherwise numerous small probability values are discarded, which then leads to small errors in the eventual product.

Since the normalization constants (gamma functions) depend on df but not on x, I was able to normalize the pdf simply by doing:

pdf /= sum(pdf)

I also used a very small step size in the Numpy array. The second section of the code listing ends by printing values from the distributions for whole numbers between -5 and 5:

pdf
-5.000  0.00422
-4.000  0.00916
-3.000  0.02297
-2.000  0.06751
-1.000  0.20675
0.000  0.36755
1.000  0.20675
2.000  0.06751
3.000  0.02297
4.000  0.00916
5.000  0.00422

calculated cdf
-5.000  0.00770
-4.000  0.01401
-3.000  0.02884
-2.000  0.06970
-1.000  0.19560
0.000  0.50018
1.000  0.80460
2.000  0.93037
3.000  0.97118
4.000  0.98600
5.000  0.99231

As a test, I show that I get very similar values using stdtr from either PyCogent or Cephes:

stdtr from transcendental
-5.000  0.00770
-4.000  0.01400
-3.000  0.02883
-2.000  0.06966
-1.000  0.19550
0.000  0.50000
1.000  0.80450
2.000  0.93034
3.000  0.97117
4.000  0.98600
5.000  0.99230

stdtr from PyCogent
-5.000  0.00770
-4.000  0.01400
-3.000  0.02883
-2.000  0.06966
-1.000  0.19550
0.000  0.50000
1.000  0.80450
2.000  0.93034
3.000  0.97117
4.000  0.98600
5.000  0.99230

And that all matches what I get from R. So I'm pretty confident that I have the distributions correct and the functions are working as they should.

> A = seq(-5,5,by=1)
> pt(A,3)
 [1] 0.007696219 0.014004228 0.028834443 0.069662984
 [5] 0.195501109 0.500000000 0.804498891 0.930337016
 [9] 0.971165557 0.985995772 0.992303781

code listing:

import numpy as np
import distribution    # also imports special
import transcendental

dx = 0.001
X = np.arange(-100,100+dx,dx)

@np.vectorize
def p(df,x):
    base = 1 + x**2/df
    exponent = (df+1)/2
    return base**(-exponent)

df = 3
pdf = p(df,X)

# normalization depends on df but not x
# so this works
pdf /= sum(pdf)
#=============================================

def show(F,multiplier=1):
    N = 95000
    for n in range(N,len(X)-N,1000):
        t = X[n], multiplier*F[n]
        print '%5.3f  %6.5f' % t
    print

# we include a multiplier for the pdf only
print 'pdf'     
show(pdf,multiplier=1000)

cdf = [pdf[0]]
for n in pdf[1:]:
    cdf.append(n + cdf[-1])
print 'calculated cdf'
show(cdf)
#=============================================

@np.vectorize
def f(x):
    return transcendental.stdtr(df,x)

cdf2 = f(X)
print 'stdtr from transcendental'
show(cdf2)

@np.vectorize
def f(x):
    return distribution.stdtr(df,x)

cdf3 = f(X)
print 'stdtr from PyCogent'
show(cdf3)