## Sunday, January 31, 2010

### Two sample t-test in Python

[UPDATE: I'm currently (November. 2010) in the middle of a series of posts about Student's t test. The first six posts are here, here, here, here, here, and here.]

I'd like to do my teaching in Python as much as possible. In keeping with that, I plan to explore methods for analyzing microarray data using Python rather than R and Bioconductor. As a preliminary matter, I'm going to use Student's t-test (two sample) as implemented in PyCogent. A simple-minded approach would be to a situation where the samples have already been classified, would be to filter them by the t-test statistic.

Here is a simple test. We draw a bunch of numbers from the normal distribution and keep them in a numpy array. We slide a window along the array, pulling out two small sets of n numbers at a time, and do a two sample t-test on them. We record the p-value. We expect that since all the numbers come from the same normal distribution the distribution of the p-value will be such that, on the average, p < 0.05 about 5% of the time.

 `# two sample t-testsimport numpy as npimport cogent.maths.stats.test as statsdef oneTrial(n,f=np.random.normal): N = 500 # num of individual tests SZ = n*2*N # need this many nums draw = f(loc=50,scale=3,size=SZ) counter = 0 for i in range(0,SZ,n*2): nums1 = draw[i:i+n] nums2 = draw[i+n:i+2*n] t, prob = stats.t_two_sample(nums1,nums2) if prob < 0.05: counter += 1 return 1.0*counter / N`

Now, we run the trial a bunch of times. For this run, I've chosen the sample size to be 3.

 `n = 3 # sample sizeL = list()for i in range(100): #if i and not i % 10: print 'i =', i L.append(oneTrial(n))print 'avg %3.4f, std %3.4f' % (np.mean(L), np.std(L)),print 'min %3.4f, max %3.4f' % (min(L), max(L))`

Output:

 `avg 0.0511, std 0.0095 min 0.0280, max 0.0780`