Python for Bioinformatics: A kick in the pants?

Thursday, April 1, 2010

A kick in the pants?

I worked up another problem from Grinstead & Snell, but it's going to be the last. I don't want to give away all the answers., although I may post about content from the later chapters as I get it figured out. The problem for this post involves the classic example of the Poisson distribution.

"In one of the first studies of the Poisson distribution, von Bortkiewicz considered the frequency of deaths from kicks in the Prussian army corps. From the study of 14 corps over a 20-year period, he obtained the data shown in [the] Table. Fit a Poisson distribution to this data and see if you think that the Poisson distribution is appropriate."

deaths number of corps with x deaths
0 144
1  91
2  32
3  11
4   2

We can think of the results for each corps over a single year as a Bernoulli trial with n very large and p very small and the mean near 1. We calculate the mean as (91 + 2*32 + 3*11 + 4*2) / 280 = 0.7. So λ = 0.7.

The code below prints the values for P(0), P(1) etc, as well as the calculated probability given this λ, and the expected and observed values for the number of corps with x deaths. The output is:

$ python kicks.py 
k       p      E     O
0   0.497 139.04   144
1   0.348  97.33    91
2   0.122  34.07    32
3   0.028   7.95    11
4   0.005   1.39     2

I think the model fits the data pretty well. However, I would like to know the appropriate statistical test for the fit.

import math
import matplotlib.pyplot as plt

def poisson(m):
    def f(k):
        e = math.e**(-m)
        f = math.factorial(k)
        g = m**k
        return g*e/f
    return f
    
p = poisson(0.7)
data = [144,91,32,11,2]
print 'k       p      E     O'
for i in range (5):
    print str(i).ljust(3),
    prob = p(i)
    print str(round(prob,3)).ljust(5), 
    print str(round(prob*280,2)).rjust(6),
    print str(data[i]).rjust(5)