Python for Bioinformatics: Grade disparity

Friday, August 14, 2009

Grade disparity - chi-squared analysis

I recently posted about two common statistical problems: the first involves making an estimate of the population mean when the sample size is small, involving the t statistic of "Student", as well as the related problem of deciding whether the means for two small samples are different, including the example of paired samples. These arise commonly in biology. The second problem is deciding whether two sets of paired values are independent, or instead whether given the value for x, we can predict y. This problem involves covariance and the simplest approach is linear regression.

The third basic problem in statistics for which I want to improve my understanding is that of the independence of two discrete distributions. This will lead us to the chi-squared density as a special case of the gamma distribution.

There is a very nice discussion of this in Grinstead & Snell (pdf). Here is their table which gives a grade distribution, comparing the results for females and males in a particular class.

If the distributions are independent (e.g. P(grade = A) = P(grade = A | sex = female), then we can predict the most likely distribution, but also expect that there will be some spread in the observed breakdown by grade and sex due to random sampling. How much deviation should we expect?

We can model the problem using colored balls. For example, we put 152 pink balls and 167 baby blue balls in to an urn, mix well, and then draw out in succession 93 'A's, 123 'B's, 90 'C's and 13 'D's.

We calculate for each category:

(O - E)²/ E

where O is the observed value and E the expected value based on the assumption of independence. We sum the value over all categories to obtain the statistic. According to theory, this statistic has a chi-squared (Χ²) distribution with degrees of freedom df = 3. (For this problem df is the number of grades - 1 * the number of sexes - 1).

If the calculated value for the statistic exceeds 95 % of the values in the distribution for this df, we reject the null hypothesis that the two probability distributions are independent. In other words, we suspect that males and females have different grade distributions for this course.

For this data, the value of Χ² that is exceeded 5 % of the time is 7.8, so the calculated value of 4.13 does not allow rejection.

Here is code for a Python simulation (R code to plot is at the end). I redirect the output to a file like so:

python grades.py > results.txt

import random
rL = list()  # for statistic values
F = 152;  M = 167
N = M + F
f = F*1.0/N;  m = M*1.0/N
v = False

grades = [93,123,90,13]
# expected values:
EF = [f*g for g in grades]
EM = [m*g for g in grades]

def test():
    print 'EF',
    for e in EF:  print round(e,2),
    print
    print 'EM',
    for e in EM:  print round(e,2),
    print
    
R = 10000 # number of trials
for i in range(10000):
    if v:  test()
    chisq = 0
    pL = ['F']*F + ['M']*M
    random.shuffle(pL)
    mL = list()
    fL = list()
    for i in range(4):  # grades A-D
        n = grades[i]   # how many 'A's...
        fL.append(pL[:n].count('F'))
        mL.append(pL[:n].count('M'))
        pL = pL[n:]

    if v:  print 'OF',' '.join([str(e).rjust(2) for e in fL])
    if v:  print 'OM',' '.join([str(e).rjust(2) for e in mL])
    for i in range(4):   
        chisq += (fL[i]-EF[i])**2/EF[i]
        chisq += (mL[i]-EM[i])**2/EM[i]
    print round(chisq,2)
    if v:  print

setwd('Desktop')
v = read.table(
  'results.txt',head=F)
B = seq(0,20,by=0.25)
hist(v[,1],freq=F,breaks=B)
curve(dchisq(x,3),
  from=0,to=20,add=T,col='blue',lwd=3)