Python for Bioinformatics: Student's t-test again 2

Friday, November 5, 2010

Student's t-test again 2

This is the second in a series of posts about Student's t test. The previous post is here. If you want to do a t-test for real data, I would urge you to use either R or SciPy.

R

R is great software. It was written by statisticians for statistical work and is thoroughly tested. My only problem is that I find it very difficult to write R code. R was not designed for text or character manipulation, where Python excels. I've been very impressed with how well the GUI works on Mac OS X.

Note: I posted about the t-test in R before (here, here and here).

SciPy

SciPy is a library that provides all kinds of goodies useful for scientific applications. It has statistical functions too, and lots of 'em, e.g. the two sample t test . If your goal is to use Python for statistical programming, that is probably where you should go. I posted about my difficulties installing SciPy: here, here and here. But I finally got it to work, even on my office computer, described briefly here. Resist the urge to "roll your own."

You could also try Sage. It should be much easier to install than SciPy (a binary with everything but the kitchen sink). However, I could only find elementary statistics in a quick search.

PyCogent

Our goal here is to understand how the tests work. It's not complicated. In order to do this I started with the statistical functions included as part of PyCogent. I have a bunch of posts about PyCogent here, including the one on the Two sample t-test in Python here.

If you don't want to install the whole PyCogent package (or you just want to take the modules apart like I did) you can download (download source) and modify three modules from the source: these are test.py, distribution.py and special.py. You can find them in /PyCogent-1.4.1/cogent/maths/stats. I copied them out onto my desktop, and then I stripped out everything I could, while still allowing my test of t_two_sample to run.

Let's look at the names defined in the stripped-down modules:

>>> def show(module):
...     for n in dir(module):
...         if not '_' == n[0]:
...             print n
... 
>>> import test
>>> show(test)
array
asarray
division
isinf
isnan
mean
sqrt
sum
t_high
t_low
t_one_observation
t_one_sample
t_paired
t_tailed_prob
t_two_sample
tprob
var
>>> import distribution
>>> show(distribution)
MACHEP
PI
atan
betai
division
exp
sqrt
stdtr
stdtri
t_high
t_low
tprob
>>> import special
>>> show(special)
GP
GQ
Gamma
Gamma_small
MACHEP
MAXGAM
MAXLOG
MINLOG
betai
betai_result
division
exp
floor
log
polevl
pseries
sin
sqrt

The modules are hierarchical. At the top level (test.py) are the functions we would call from our scripts, including t_two_sample. These call down into distribution.py to stdtr, which is the standard t distribution. This distribution is in turn computed using the integral of the incomplete beta function (betai) and the Gamma function in special.py.

stdtr is a cumulative distribution function. I used a trick to compute the pdf from it, and then plot the pdf for various values of the degrees of freedom (df). This figure is at the top of the post. As you can see, the smaller the df, the "fatter" the tails on the distribution. I made a second plot for the cdf.

script.py

import math
import numpy as np
import matplotlib.pyplot as plt
from distribution import stdtr

PI = math.pi    # needed by stdtr
dx = 0.01
X = np.arange(-4,4+dx,dx)
colors = 'rkkkm'
plot_pdf = True

for i,df in enumerate([1,2,5,10,30]):
    cdf = [stdtr(df,x) for x in X]
    pdf = [0]
    for j in range(1,len(cdf)):
        pdf.append(cdf[j] - cdf[j-1])
    Y2 = np.array(pdf)
    Y2 /= sum(Y2)
    if plot_pdf:
        plt.plot(X,Y2,color=colors[i],lw=2)
    else:
        plt.plot(X,cdf,color=colors[i],lw=2)

plt.savefig('example.png')

ttest.py

# two sample t-tests
import numpy as np
import test as stats

def oneTrial(n,f=np.random.normal):
    N = 50      # num of individual tests
    SZ = n*2*N    # need this many nums
    draw = f(loc=50,scale=3,size=SZ)
    counter = 0
    for i in range(0,SZ,n*2):
        nums1 = draw[i:i+n]
        nums2 = draw[i+n:i+2*n]
        t, prob = stats.t_two_sample(nums1,nums2)
        if prob < 0.05:  counter += 1
    return 1.0*counter / N

n = 3             # sample size
L = list()
for i in range(10):
    #if i and not i % 10:  print 'i =', i
    L.append(oneTrial(n))

print 'avg %3.4f, std %3.4f' % (np.mean(L), np.std(L)),
print 'min %3.4f, max %3.4f' % (min(L), max(L))

output from the second script:

avg 0.0515, std 0.0094 min 0.0300, max 0.0720

The output shows that on the average, the t statistic is < 0.05 about 5% of the time, as expected. Compare to the results from here.

Still to come: computing the t distribution, and setting up the tests.