Saturday, April 21, 2012

Testing

It seems improbable, but it is a fact that on the first exam in some courses, a few students may score lower than expected for guessing at random. We usually see some scores of less than 20 out of 100 for an exam whose questions have five possible answers ABCDE.

We can turn this into an illustration of statistical significance. How high must a score be before one can dismiss the null hypothesis (no knowledge of the subject) with a p-value of 0.05?

We have a Bernouli trial with p = 0.2.

The expected value or mean is np. There is a beautifully simple derivation of the mean and variance in the wikipedia article, the variance is np(1-p). For our example, the mean is 20 and the variance is 100 x 0.2 x 0.8, so the standard deviation is √16 = 4.

For a large class (N = 100) of randomly answering students, the distribution appraches normal, and we expect that 2.5% of such students will score at or above two standard deviations above the mean. To make our rate for a type I error less than 0.05, we should require a score of 28 or higher.

I wrote a short script to simulate this. Each student takes a test with 100 questions, and each trial consists of a class of 100 students. We count the number of students with scores at or above a threshold T. The results for 10 trials are printed, and the results for 100 trials are averaged.

A few quick points: the mean appears to fall between 20 and 21. I'm sure there's a simple explanation but it eludes me at the moment. The score that 2.5% of students regularly exceed is 28.

This is a great example of the danger of multiple hypothesis testing!

One last thing is that the code is not very smart. What I should have done was extract the statistics for each value of T for each trial, rather than repeating the trial for each value of T. But it runs fast enough, and this way I don't need a data structure to remember the results.

> python script.py 
20:  56 43 54 57 57 54 57 46 45 64 55.0
21:  39 48 42 44 39 44 48 48 50 48 43.5
22:  43 32 39 30 38 35 36 42 33 41 34.6
23:  40 29 35 31 27 19 30 20 31 25 26.0
24:  19 18 24 26 20 22 21 17 15 24 19.2
25:  11 14  8  8 13 11 12 15 17 13 12.9
26:  12  6  6  6 11  4  8  7 10  6  9.2
27:   7  4  6  5  8  7  5  7  7  7  5.9
28:   3  2  4  2  3  1  2  3  5  1  3.6
29:   4  4  2  2  2  1  3  1  0  1  2.2
30:   1  1  1  0  1  1  1  0  0  1  0.9
31:   4  0  1  0  0  0  0  0  0  0  0.6
32:   0  0  0  0  1  0  0  1  0  0  0.3
33:   0  0  0  0  0  1  0  0  0  1  0.1
34:   0  0  0  0  0  0  0  0  1  0  0.1

script.py
import random
    
def student(N=100, p=0.2):
    L = [random.random() <= p for i in range(N)]
    return sum(L)

def trial(T):
    L = [student() for i in range(100)]
    return sum([1 for n in L if n >= T])
    
def mean(L):
    return 1.0*sum(L)/len(L)

for T in range(20,35):
    L = list()
    print '%2d: ' % T,
    for i in range(100):
        L.append(trial(T))
        if i < 10:
            print '%2d' % L[-1],
    print '%4.1f' % mean(L)