Python for Bioinformatics: Approaching Normal

This post continues with the theme of exploring the normal distribution. I wanted to take a look at how fast repeated samples from a uniform distribution approach the normal, and in the process exercise my R skills a bit. If you have not used R, I recommend it for graphics and exploring statistics. It works well (and looks beautiful) on my Mac. In the course I taught in Spring, there was a student who had trouble installing it under Windows, but that machine was hosed anyway. Disclaimer: I am not an R guru. If I'm not "doing it right", let me know.

The first thing is to model rolling a standard six-sided die. We want to sample from the uniform distribution of integers between 1 and 6. Don't forget to sample with replacement. As usual, I use yellow background for my code, and blue background to show what the program prints to the screen.

d = 1:6
u = sample(d,10000,replace=T)
mean(u)
var(u)

As we expect, the mean or expected value is 1/6 * (1 + 2 + 3 + 4 + 5 + 6) = 21/6 = 3.5 and the variance is 1/6 * 2 * (0.5² + 1.5² + 2.5²) = 1/3 * (0.25 + 2.25 + 6.25) = 1/3 * 8.75 = 2.92, and the (population) standard deviation is √2.92 = 1.71.

> mean(u)
[1] 3.4786
> var(u)
[1] 2.917434
> sd(u)
[1] 1.708050
> summary(u)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  1.000   2.000   3.000   3.479   5.000   6.000

We will take a look at the distribution of the numbers using the hist function. There are a couple of details to consider when using hist. (I often have to remind myself about its arguments by doing "?hist"). One is the argument "breaks"

breaks  one of:
a vector giving the breakpoints between histogram cells,
a single number giving the number of cells for the histogram,
a character string naming an algorithm to compute the 
  number of cells (see ‘Details’),
a function to compute the number of cells.
In the last three cases the number is a suggestion only.

I often specify the number of cells explicitly, especially when I want to compare multiple plots. Since I had a little trouble with this plot, I tried a couple of different things, and I want to plot them all in the same window. For this, I use the formatting command "par" and tell it to make a set of plots in 1 row and 3 columns.

par(mfrow=c(1,3))
hist(u,breaks=6,col='blue')
hist(u,right=F,
  breaks=6,col='blue')
hist(u-0.5,breaks=6,col='blue')

As you can see, the first two histograms look funny. What is going on is that R is trying to bin the numbers in the vector with breakpoints exactly on the integer values 1, 2, 3... Since the numbers are themselves drawn from 1, 2, 3... it has to go right or left, and at the boundaries of the plot it looks weird. One argument controlling how this works is:

right  logical; if TRUE, the histograms cells are right-closed
   (left open) intervals.

The default setting is "TRUE", which has cells formed as "right-closed (left-open) intervals"---whatever that means. The result is that at the left boundary all the values "1" and "2" have been binned together. Neither setting for "right" is what we want. My solution was to shift all the values to the left by 0.5 and then plot the result. The "1"s are plotted in the cell between 0 and 1.

Now, let's see what happens if we roll the dice again. What we're going to do is start with a vector of the size we need called z, that is filled with zeroes. (It has a length of 50,000). We need to initialize it because we will be updating at each round. We roll the dice six times and plot the results at each stage.

As you can see, we already have a reasonable approximation to the normal after summing just two numbers drawn at random from the "sample space" of 1, 2... , 6. And by n=6 the approximation is very good. Since the standard deviation goes as √variance, and the variances add, by n=6 we have a range of 31 (from 6 to 36) but the sd is only √(6*2.91) = 4.18 (4.19 in the figure).

Here is the code:

d = 1:6
par(mfrow=c(3,2))
z = rep(0,50000)
for (i in 1:6) {
  B = i*6
  z = z + sample(d,50000,replace=T)
  w = z+0.5
  hist(w,breaks=B,
    xlim=c(min(w)-2,max(w)+2),
    col='gray90',freq=F,
    main=paste('sd = ',
      round(sd(w),2),sep=""),
    xlab=paste('round',i))
  plot(function(x) dnorm(x,mean(w),sd(w)),
    0,max(w),lwd=2,add=T,col='red')
  }

I found the code for the function that plots the normal distribution with the same mean and sd as our samples somewhere on the web. It is doing something a bit funny. I think it is making an "anonymous" (i.e. unnamed) function to feed to plot, and then applying that function with bounds = 0 and max(w), but I'm not real clear about how this works. The parameter lwd is for line width.

Finally, let's look a little more closely at the z vector after 6 rounds. We plot it in the same histogram as the normal density of the same mean and sd, switching which one is in back as we move from the left panel to the right.

The code:

mean(z)
sd(z)
summary(z)
x=rnorm(length(z),mean(z),sd(z))
summary(x)

par(mfrow=c(1,2))
hist(z,breaks=40,freq=F,
  xlim=c(0,40))
hist(x,col='gray80',breaks=40,
  freq=F,add=T)

hist(x,col='gray80',breaks=40,
  freq=F,xlim=c(0,40))
hist(z,col='white',
  breaks=40,freq=F,add=T)

Notice, however, that the correspondence is not perfect, as shown by the results of calling the summary function.

> mean(z)
[1] 21.00562
> sd(z)
[1] 4.155018
> summary(z)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   7.00   18.00   21.00   21.01   24.00   36.00 
> x=rnorm(length(z),mean(z),sd(z))
> summary(x)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  3.297  18.210  20.980  20.990  23.780  38.350

Our vector z seems a little fatter at the first and third quartiles and yet its minimum and maximum values are closer to the mean than the true normal distribution.

Wednesday, July 29, 2009

Approaching Normal