Python for Bioinformatics: Chi-squared

Tuesday, August 4, 2009

Chi-squared

            handed
         ------------
         right   left   Total
female    RF      LF
  male    RM      LM
 Total

For any values (RF,LF,RM,LM) we calculate the marginal totals and then we can ask, how surprising is the observed distribution given the marginal distributions?

Suppose the distribution is that we have 60 females and 80 right-handers, out of a total of 100 subjects. (Actually, the incidence of left-handers is about half) this much.

If hand bias and sex are not associated, what distribution of distributions do we expect by chance? Let's do a simulation in Python. Here are the results from 50,000 reps.

      mean      sd      95% CI

RF    48.0    1.97   44.14   51.85
LF    12.0    1.97    8.15   15.86
RM    32.0    1.97   28.15   35.86
LM     8.0    1.97    4.14   11.85

I was quite surprised by this at first. On reflection, I guess it is implicit in the whole idea of the chi-square test, but it just completely passed me by that the standard deviation for each subgroup is necessarily the same.

It is connected to the fact that, as they say, there is only one degree of freedom. If our group of 60 females happens to have not 48 but 53 right-handers, then the rest is all determined. There will be 7 left-handed females, and for the 40 males there will be not 32 right-handers but only 27, and not 8 left-handers but 13. Each group will be shifted from its mean value by the same amount. So the total amount of "shifting" or variation from the mean, is the same for each group. And of course the sum of the square of this difference for the whole series of trials will also be the same. Huh!

import random,math
random.seed(157)
RF = list()
LF = list()
RM = list()
LM = list()
LL = [RF,LF,RM,LM]
nameL = ['RF','LF','RM','LM']

def initOne():
    R = 80;  F = 60;  N = 100
    bias = list('R'*R + 'L'*(N-R))
    sex =  list('F'*F + 'M'*(N-F))
    random.shuffle(sex)
    random.shuffle(bias)
    return bias,sex

for i in range(50000):
    bias,sex = initOne()
    L = list()
    while sex:
        L.append(bias.pop() + sex.pop())
    for i,case in enumerate(nameL):
        LL[i].append(L.count(case))

for i,L in enumerate(LL):
    #print L
    print nameL[i] + '   ',
    
    n = len(L)
    m = sum(L)*1.0/n
    sumsq = sum([(e-m)**2 for e in L])
    sd = math.sqrt(sumsq/n)

    print str(m).rjust(4), 
    print str(round(sd,2)).rjust(7), 
    print str(round(m-1.96*sd,2)).rjust(7), 
    print str(round(m+1.96*sd,2)).rjust(7)