Python for Bioinformatics: Student's t-test again 4

Friday, November 5, 2010

Student's t-test again 4

One Sample t-test

This is the fourth in a series of posts about Student's t test. The previous posts are here , here and here.

There are several different versions of the t-test, but the simplest is the one sample test. We will have a single sample with a (relatively) small number of values. We calculate the mean and the variance and then the standard deviation of the sample values. Importantly, the variance and standard deviation are the unbiased versions, in which the sum of squares is divided by n-1 rather than n.

It's easy to see that we should not use numpy's var for this test:

>>> import numpy as np
>>> A = [1,2,3]
>>> np.var(A)
0.66666666666666663

As help(np.var) indicates:

    Notes
    -----
    The variance is the average of the squared deviations from the mean,
    i.e.,  var = mean(abs(x - x.mean())**2).  The computed variance is biased,
    i.e., the mean is computed by dividing by the number of elements, N,
    rather than by N-1.

In the code below we do:

    z = np.mean(A) - mu 
    z /= unbiased_std(A)

So, at least conceptually, at this point we have a z-score for the sample mean. With a standard z-score, we would compare it to the normal distribution. Here we do two things differently: we multiply by √n (the square root of the number of samples), and we find the value of the resulting t statistic in the t distribution.

t-statistic  p-value
-2.378         0.049

The result indicates that, given a prior choice of a limit of 0.05 (for this one-sided test), the null hypothesis that the mean of the sample values is equal to 3 is not supported (just barely).

You can read much more about the background of the test here.

The second set of values in the code below gives:

t-statistic  p-value
-6.455         0.004

It's reassuring that a One Sample t-test in R gives a similar result. I'll have more to say about "tails" and sidedness in another post.

> A = c(3.1,2.3,2.1,1.7)
> t.test(A, alternative='less',mu=3)

 One Sample t-test

data:  A 
t = -2.3778, df = 3, p-value = 0.04891
alternative hypothesis: true mean is less than 3 
95 percent confidence interval:
     -Inf 2.992811 
sample estimates:
mean of x 
      2.3 

> B = c(2.1,1.8,2.7,2.4)
> t.test(B, alternative='less',mu=3.5)

 One Sample t-test

data:  B 
t = -6.455, df = 3, p-value = 0.003771
alternative hypothesis: true mean is less than 3.5 
95 percent confidence interval:
     -Inf 2.705727 
sample estimates:
mean of x 
     2.25 

>

Obviously, we could use a lot more testing.

code listing:

from __future__ import division
import numpy as np
from transcendental import stdtr
from utils import unbiased_std

def one_sample_t(A,mu):
    n = len(A)
    df = n - 1
    z = np.mean(A) - mu 
    z /= unbiased_std(A)
    t = z * np.sqrt(n)
    return t, stdtr(df,t)

def test(A,mu):
    result = one_sample_t(A,mu)
    print 't-statistic  p-value'
    print '%5.3f         %5.3f' % result
    
if __name__ == '__main__':
    A = np.array([3.1,2.3,2.1,1.7])
    B = np.array([2.1,1.8,2.7,2.4])
    test(A,mu=3)
    test(B,mu=3.5)

'''
R code:
A = c(3.1,2.3,2.1,1.7)
t.test(A, alternative='less',mu=3)
B = c(2.1,1.8,2.7,2.4)
t.test(B, alternative='less',mu=3.5)
'''

utils.py

from __future__ import division
import numpy as np

def unbiased_var(X):
    n = len(X)
    sample_SS = sum(X**2) - sum(X)**2 / n
    return sample_SS/ (n-1)
    
def unbiased_std(X):
    return np.sqrt(unbiased_var(X))