Python for Bioinformatics: Student's t-test again 5

Monday, November 8, 2010

Student's t-test again 5

Code

This is the fifth in a series of posts about Student's t test. The previous posts are here , here, here, and here.

Last time I showed a version of the one-sample t-test in Python. As we saw, the t-statistic is constructed similar to a standard error (like the sem). It is something like a z-score using the expected value for the population mean, and the observed (unbiased) sample standard deviation, then adjusted by multiplying this "z-score" by √n. (The same function, slightly edited, is also given below).

The paired t-test is simply a one-sample t-test done on the differences between paired values, comparing them to the expected difference rather than the expected mean. The paired values arise, for example, in repeated tests of the same individuals. The idea is that variance due to differences between individuals is the same for each value in a single pair.

The two-sample t-test takes account of the expected difference in means:

diff = (np.mean(A) - np.mean(B) - expected_diff)

The code I copied (and massaged) takes the observed sample variance and turns it back into the sum of squares:

sum_sq = (var(A)*(na-1) + var(B)*(nb-1))

There's a super-duper factor which is constructed from the number of values in each sample (and the dependent degrees of freedom):

na = len(A)
nb = len(B)
df = na + nb - 2
f = (1/na + 1/nb)/df
t = diff/np.sqrt(sum_sq*f)

The t-statistic is as shown. I offer no justification for these procedures. (If you want more theory and don't like your book, start here: pdf). What we will do is test the output from these functions. Today we simply check that R gets the same t-statistic for one (or two) examples. Next time we will construct a test harness that does more. And after that we may do some more testing that shows these functions do give the correct distribution with simulated data.

I feel like I do understand the tests better after this project.

The functions for unbiased variance and standard deviation in utils.py were given last time. There's something a bit unusual about the variance calculation, which comes from PyCogent. I'll try to remember to deal with that when I talk about error checking, and "tails", in a future post. The functions here are just the basic t-statistic calculations themselves.

The source module for the t distribution:

from transcendental import stdtr

was explained here.

Here is the output if the module is run as "__main__":

t-statistic  p-value
-2.378         0.049
-6.455         0.004
-2.611         0.040
-2.574         0.021

R output using the code as given at the end of the listing (and as shown here):

> A = c(3.1,2.3,2.1,1.7)
> result = t.test(A, alternative='less',mu=3)
> result$statistic
        t 
-2.377782 
> B = c(2.1,1.8,2.7,2.4)
> result=t.test(B, alternative='less',mu=3.5)
> result$statistic
        t 
-6.454972 
> C = c(3.1,4.3,4.1,2.7)
> result=t.test(A,C,alternative='less',
+   paired=TRUE,var.equal=FALSE)
> result$statistic
        t 
-2.611165 
> result=t.test(A,C,alternative='less',var.equal=TRUE)
> result$statistic
        t 
-2.573993 
>

code listing:

from __future__ import division
import numpy as np
from transcendental import stdtr
from utils import unbiased_var as var
from utils import unbiased_std as std

def one_sample_t(A,mu):
    n = len(A)
    df = n-1
    z = (np.mean(A) - mu) / std(A)
    t = z * np.sqrt(n)
    return t, stdtr(df,t)

def paired_t(A,B,expected_diff=0):
    return one_sample_t(A - B,expected_diff)

def two_sample_t(A,B,expected_diff=0):
    diff = (np.mean(A) - np.mean(B) - expected_diff)
    na = len(A)
    nb = len(B)
    df = na + nb - 2
    sum_sq = (var(A)*(na-1) + var(B)*(nb-1))
    f = (1/na + 1/nb)/df
    t = diff/np.sqrt(sum_sq*f)  
    return (t, stdtr(df,t))
#==========================================
def test(func,D):
    if func == one_sample_t:
        result = one_sample_t(D['X'],D['mu'])
    elif func == paired_t:
        result = paired_t(D['X'],D['Y'])
    elif func == two_sample_t:
        result = two_sample_t(D['X'],D['Y'])
    if D['verbose']:  
        print '%5.3f         %5.3f' % result
    return result
    
if __name__ == '__main__':
    print 't-statistic  p-value'
    
    A = np.array([3.1,2.3,2.1,1.7])
    B = np.array([2.1,1.8,2.7,2.4])
    test(one_sample_t,{'X':A,'mu':3,'verbose':True})
    test(one_sample_t,{'X':B,'mu':3.5,'verbose':True})
    
    C = np.array([3.1,4.3,4.1,2.7])
    test(paired_t,{'X':A,'Y':C,'verbose':True})
    test(two_sample_t,{'X':A,'Y':C,'verbose':True})

'''
R code:
A = c(3.1,2.3,2.1,1.7)
result = t.test(A, alternative='less',mu=3)
result$statistic
B = c(2.1,1.8,2.7,2.4)
result=t.test(B, alternative='less',mu=3.5)
result$statistic
C = c(3.1,4.3,4.1,2.7)
result=t.test(A,C,alternative='less',
  paired=TRUE,var.equal=FALSE)
result$statistic
result=t.test(A,C,alternative='less',var.equal=TRUE)
result$statistic
'''