Thursday, May 8, 2008

Python for simulations (2)

The Central Limit Theorem is supposed to be very important for Bioinformatics. The Deonier book, for example, discusses it starting on p. 79 and even gives a proof. My contention is that it is unnecessary for a biologist working in bioinformatics to understand this proof. It would be nice but it is an unnecessary distraction. We are like engineers and not mathematicians. We should know that one can prove, but not necessarily know how to prove that the maxima and minima of a curve occur when the slope is zero, we just want to know how to calculate these values. It's like Treasure of the Sierra Madre, when the bandit says: 'lemmas, we don't need no stinking lemmas.' Or something. But what is nice, is to show a demonstration that the Central Limit Theorem predicts results that are borne out by experiment. In short, we need more simulations.

Here is a Python script that generates a list or vector of numbers from five distributions: uniform (0 to 1), normal (mean = 0, stdev = 2), and three beta distributions. One of these has alpha, beta = 5, 1; the others are 2, 5 and 0.5, 0.5. We generate 2E5 numbers from each distribution and save them to disk (14.3 MB file). We then take groups of 50 from each distribution and sum them, and save those results to disk in a separate file. We use R for plotting. I didn't time them but Python and R each take about 10 seconds on my machine.

It looks like the mathematicians may be on to something: