Sunday, January 31, 2010

Two sample t-test in Python

[UPDATE: I'm currently (November. 2010) in the middle of a series of posts about Student's t test. The first six posts are here, here, here, here, here, and here.]

I'd like to do my teaching in Python as much as possible. In keeping with that, I plan to explore methods for analyzing microarray data using Python rather than R and Bioconductor. As a preliminary matter, I'm going to use Student's t-test (two sample) as implemented in PyCogent. A simple-minded approach would be to a situation where the samples have already been classified, would be to filter them by the t-test statistic.

Here is a simple test. We draw a bunch of numbers from the normal distribution and keep them in a numpy array. We slide a window along the array, pulling out two small sets of n numbers at a time, and do a two sample t-test on them. We record the p-value. We expect that since all the numbers come from the same normal distribution the distribution of the p-value will be such that, on the average, p < 0.05 about 5% of the time.

# two sample t-tests
import numpy as np
import cogent.maths.stats.test as stats

def oneTrial(n,f=np.random.normal):
N = 500 # num of individual tests
SZ = n*2*N # need this many nums
draw = f(loc=50,scale=3,size=SZ)
counter = 0
for i in range(0,SZ,n*2):
nums1 = draw[i:i+n]
nums2 = draw[i+n:i+2*n]
t, prob = stats.t_two_sample(nums1,nums2)
if prob < 0.05: counter += 1
return 1.0*counter / N

Now, we run the trial a bunch of times. For this run, I've chosen the sample size to be 3.

n = 3             # sample size
L = list()
for i in range(100):
#if i and not i % 10: print 'i =', i

print 'avg %3.4f, std %3.4f' % (np.mean(L), np.std(L)),
print 'min %3.4f, max %3.4f' % (min(L), max(L))


avg 0.0511, std 0.0095 min 0.0280, max 0.0780