Python for Bioinformatics: Student's t-test again 1

Thursday, November 4, 2010

Student's t-test again 1

Periodically, I've wondered about whether anyone reads these posts. In a meta sort of way, I've even posted about it.

As ABOUT ME says, a major objective here is to provide a record so that "google can organize my head." (bbum now says "index"---well, OK)

Still, one wonders. And I have to admit that I'm NOT repeat NOT an expert on most of these topics. I'm just a student, and a true believer in Python. Although I'm sure Ruby would satisfy me as well (if I hadn't met Python first), but I think the possible advantage would be marginal.

And, please, if you find an error here (or even something interesting), speak up.

So I was excited to discover this AM that blogger has stats. You can't see my stats, but I can. And the stats say that 166 people loaded pages from my blog today. 166 != 0. That's great. A perennial winner is a post about t-tests.

That smells like homework. Nevertheless, it put a thought in my tiny brain, and it won't go away. So I started thinking about the t-test: the what, the why, the how, and the Python.

Note: I posted about the t-test in R before (here, here and here).

Anyway, I'm going to look into the subject a bit more seriously. But for starters, let's model the basic problem. If we sample repeatedly from a population of known mean and variance, then the accuracy of the observed means (the standard error of the mean) depends markedly on sample size. The normal distribution of the set of means is guaranteed by the miraculous Central Limit Theorem (here, and wikipedia).

The t-test deals with samples of small size, which are ubiquitous in classical biology. And even in microarray studies, if you consider the data gene-by-gene. Of course there is more to it, for example, whether the size of the samples is the same, and their individual variances.

Here is a simple Python simulation that shows the problem. The first set of samples is from a normal distribution, the second from a uniform distribution. You can clearly see the dependence of the standard deviation of the sample means on the sample size. The column labeled stdev is the computed standard deviation for the observed means, and the column labeled sem is the predicted standard deviation based on the population standard deviation and n:
sem = σ / √n.

num     mean     stdev       sem
 2     10.09      3.56      3.54
 3      9.99      2.86      2.89
 4     10.03      2.53      2.50
 5      9.97      2.23      2.24
10     10.01      1.58      1.58
15     10.02      1.29      1.29
20     10.02      1.11      1.12
25     10.01      1.00      1.00
30     10.00      0.91      0.91

 2     10.01      4.08
 3     10.02      3.33
 4     10.00      2.90
 5     10.01      2.57
10      9.99      1.80
15     10.00      1.48
20      9.99      1.28
25     10.00      1.16
30     10.01      1.06

Code listing:

import numpy as np
import random, math

N = int(1e6)
mu = 10
sigma = 5
A_normal = np.random.normal(loc=mu,scale=sigma,size=N)
A_uniform = np.random.uniform(low=0,high=mu*2,size=N)

def sample(A,SZ):
    rL = list()
    for j in range(10000):
        L = [random.choice(A) for k in range(SZ)]
        rL.append(np.mean(L))
    return (np.mean(rL),np.std(rL))

def show(A,extra=True):
    R = range(2,6) + range(10,31,5)
    for SZ in R:
        t = sample(A,SZ)
        pL = ['%2d' % SZ,
            '%5.2f' % t[0],
            '%5.2f' % t[1]]
        if extra:
            pL.append('%5.2f' % (sigma/math.sqrt(SZ)))
        print (' '*5).join(pL)
    print

print 'num     mean     stdev       sem'
show(A_normal)
show(A_uniform,extra=False)