Python for Bioinformatics: The normal (Gaussian) distribution

Monday, July 27, 2009

The normal (Gaussian) distribution

I'm not very good at proofs, but I wanted to try to understand where the normal distribution comes from. In fact, we saw in an earlier post that we can show by simulation that the Central Limit Theorem seems to be correct. Regardless of the underlying distribution, the sample mean x is normally distributed if the sample size is sufficiently large.

However, let's try this argument, which is originally due to Sir John F. W. Herschel.

Imagine that you are throwing darts at the origin of the x,y plane. Under perfect conditions, you would hit the center dead on every time. However, conditions aren't perfect. The wind is gusting, the music is loud, your blood alchohol is modestly elevated, there are other distractions. As a result, small errors creep in and the pattern over time looks like so:

The R code:

x=rnorm(1000)
y=rnorm(1000)
L=c(-3,3)
plot(x,y,pch=16,xlim=L,ylim=L,col='blue')
lines(c(0,0),c(-3,3),lty=2,lwd=2)
lines(c(-3,3),c(0,0),lty=2,lwd=2)

Now, there is some unknown function for the probability that a dart will land in the interval between x and x + ∆x. Obviously, the probability depends on x, with a maximum at x = 0 and then decreasing to zero as x gets large. We designate that function as a probability density function p(x) and evaluate the density over the interval to get the probability that the dart lands in the interval:

Prob = p(x) ∆x

Now we consider a small area of size ∆x∆y. If:

the errors in perpendicular directions are independent

then we expect that p(x) = p(y) and we can get the probability that a dart lands in the small rectangle bounded by x, y and x + ∆x, y + ∆y as:

Prob = p(x)∆x p(y)∆y

In fact, if we assume that the errors do not depend on the orientation of the coordinate system, then the probability is a function only of r, the radial distance from the origin, so we can write

Prob = g(r)∆x∆y
g(r)∆x∆y = p(x)∆x p(y)∆y
g(r) = p(x) p(y)

This assumption of rotational independence will lead us directly to the answer, as you will see. As Hamming says, since r does not depend on the angle θ, (but x and y do), we can take the partial derivative with respect to θ of g(r) and set it equal to zero, so that:

We can parse this. We used the standard multiplication rule (twice): "this times the derivative of that plus that times the derivative of this." We use it to generate the first line (taking the partial derivative of p(x) p(y)). And then, we need to actually find the partial derivatives of p(x) and p(y) with respect to θ, where x = r cos(θ) and y = r sin(θ). We use the multiplication rule again, and the fact that the derivative of the sine is just the cosine, while the derivative of the cosine is minus the sine. Thus, for example, the partial derivative of x with respect to θ is simply -y.

As stated, this gives:

p(x) p'(y)(x) - p(y)p'(x)(y) = 0
p'(x)/x p(x) = p'(y)/y p(y)

Since x and y are both variables:

p'(x)/x p(x) = p'(y)/y p(y) = K
p'(x)/p(x) = Kx

We need a function p(x) whose derivative p'(x) is equal to p(x) times x times a constant. Remember the exponential function from a few days ago?

p(x) = A exp { Kx²/2 }

Since we assume that large errors are less likely than small ones, K < 0, so we can define another constant V = - 1/K and

p(x) = A exp { -x²/2V }

This is the normal distribution with variance V.

It is amazing how far we got with this argument! We assumed:

(1) the errors do not depend on the orientation of the coordinate system.
(2) errors in perpendicular directions are independent. This means that being too high doesn't alter the probability of being off to the right.
(3) large errors are less likely than small errors.

The pdf from Dan Teague has more. Notice that although we started talking about a probability distribution in two dimensions, the function we end up with is for one dimension.

Even better, James Clerk Maxwell used the same argument in three dimensions to derive his expression for the distribution of molecular velocities in a gas. Here is a very cool simulation that shows the distribution.