Python for Bioinformatics: Two sample t-test in Python

Sunday, January 31, 2010

Two sample t-test in Python

[UPDATE: I'm currently (November. 2010) in the middle of a series of posts about Student's t test. The first six posts are here, here, here, here, here, and here.]

I'd like to do my teaching in Python as much as possible. In keeping with that, I plan to explore methods for analyzing microarray data using Python rather than R and Bioconductor. As a preliminary matter, I'm going to use Student's t-test (two sample) as implemented in PyCogent. A simple-minded approach would be to a situation where the samples have already been classified, would be to filter them by the t-test statistic.

Here is a simple test. We draw a bunch of numbers from the normal distribution and keep them in a numpy array. We slide a window along the array, pulling out two small sets of n numbers at a time, and do a two sample t-test on them. We record the p-value. We expect that since all the numbers come from the same normal distribution the distribution of the p-value will be such that, on the average, p < 0.05 about 5% of the time.

# two sample t-tests
import numpy as np
import cogent.maths.stats.test as stats

def oneTrial(n,f=np.random.normal):
    N = 500      # num of individual tests
    SZ = n*2*N    # need this many nums
    draw = f(loc=50,scale=3,size=SZ)
    counter = 0
    for i in range(0,SZ,n*2):
        nums1 = draw[i:i+n]
        nums2 = draw[i+n:i+2*n]
        t, prob = stats.t_two_sample(nums1,nums2)
        if prob < 0.05:  counter += 1
    return 1.0*counter / N

Now, we run the trial a bunch of times. For this run, I've chosen the sample size to be 3.

n = 3             # sample size
L = list()
for i in range(100):
    #if i and not i % 10:  print 'i =', i
    L.append(oneTrial(n))

print 'avg %3.4f, std %3.4f' % (np.mean(L), np.std(L)),
print 'min %3.4f, max %3.4f' % (min(L), max(L))

Output:

avg 0.0511, std 0.0095 min 0.0280, max 0.0780