Python for Bioinformatics: Normal approximation to the binomial

Tuesday, July 28, 2009

Normal approximation to the binomial

I know that the normal can be used as an approximation to the binomial. I was looking for a derivation of this, and I found it via google in a math forum. Doctor Anthony begins:

Derivation of the Normal distribution from the Binomial distribution
---------------------------------------------------------------------

Let a variate take values 0, k, 2k, 3k, ..., nk  
with probabilities given by successive terms of
(q + p)^n.

What's with the k? Well, we're eventually going to want non-integer terms. The expansion of (q + p)ⁿ is familiar:

qⁿ + nq^n-1 p + n(n-1)/2 q^n-2 p² + ...

The i^th term of the expansion is C(n,i).

Then the mean m = npk and the variance s^2 = npqk^2

OK. Notice use of the multiplication rule for variance from the other day.

Suppose:

   y = probability of occurrence of rk = C(n,r) p^r q^(n-r)

Also let:

   y' = probability of occurrence (r+1)k  = C(n,r+1)p^(r+1) q^(n-r-1)

Then:

   y' - y = C(n,r+1)p^(r+1) q^(n-r-1) - C(n,r)p^r q^(n-r)

             n!p^r q^(n-r-1)
          =  ---------------[(n-r)p - (r+1)q]
             (r+1)! (n-r)!

Hmm... I know that

  y =   [n! /  r!    (n-r)!]     p^r     q^n-r
  y' =  [n! / (r+1)! (n-r-1)!]   p^r+1    q^n-r-1
  y' - y = ?

Lucie, you got some factoring to do. Let's deal with q and p first.

The left term has q^n-r-1 and the right term has q^n-r, so we can factor out
q^n-r-1, leaving a factor of q on the right-hand term in the brackets.

Similarly we can factor out p^r from both sides leaving a factor of p on the left.

The combination expressions expand as shown above. We can factor out n! from both sides. We can factor out 1/(r+1)! from both sides, if we first multiply top and bottom of the right-hand term by (r+1), leaving (r+1) on the top.

Similarly, we can factor out (n-r)! from both sides, if we first multiply top and bottom of the left-hand term by (n-r), leaving (n-r) behind on the top. So everything checks out so far. Next, he wants to divide by y:

And:

   y' - y      1                         1
   ------  = ------[np - r(p+q) - q] = ------[np - r - q]
     y       (r+1)q                    (r+1)q

(Equation 1)

Hmm...again. We're dividing the expression we had above by y.

  y =   [n! /  r! (n-r)!]        p^r     q^n-r

We have:

             n!p^r q^n-r-1
  y' - y  =  ---------------[(n-r)p - (r+1)q]
             (r+1)! (n-r)!

So both n! and (n-r)! terms cancel. We also cancel r!, leaving a factor of (r+1) on the bottom. The p^r cancels, and the q^n-r also cancels leaves a factor of q on the bottom. So I get:

   y' - y      1
   ------  = ------[(n-r)p - (r+1)q]
     y       (r+1)q

Now we have to figure out how to rearrange the term in brackets:

  [(n-r)p - (r+1)q]

Expand, and then substitute for p + q = 1:

  np - rp - rq - q
  np - r(p+q) - q
  np - r - q

It checks out. Doctor Anthony continues:

Let:

   x = rk - npk, so that x is now the variate measured 
   from the mean.

Then:

   r = x/k + np      and       r+1 = x/k + np + 1

Thus:

   k(r+1) = x + k + npk

   k^2 (r+1)q = (x + k + npk)qk

So far so good.

Multiply top and bottom of the righthand side of Equation 1 by k^2. 
Then:

   y' - y   [(np-r)k - qk]k       
   ------ = ---------------           [note that (np-r)k = -x]
      y     [x + k + npk]qk

Go back to what we had, and then multiply top and bottom by k²:

   y' - y    [np - r - q] k^2
   ------ =  ----------------
      y      (r+1)q       k^2

Hmm... The top is fine, but on the bottom we had

(r+1) q k²

We need to get to:

[x + k + npk]qk

He says:

[note that (np-r)k = -x]

OK, so we have:

(r+1) q k²
(rk + k) q k

Since:
(np - r) k = -x
rk = npk + x

Substituting:
(x + k + npk) q k

Moving on to substitute for (np-r) k = -x on top and multiplying out on the bottom yields:

              (-x - kq)k
          = ----------------          
            npqk² + (x+k)qk

Finally, we now let k = dx, so that y' - y = dy and 
let n ->infinity in such a way that nk^2 is finite. 
Equation 2 can then be written as:

    dy     (-x - q dx)dx
   ---- = ----------------
    y     s^2 + (x+dx)q dx

The only tricky part here was that we've replaced npqk² by s².
Now he says:

As dx -> 0 this becomes:
 
    dy     -x dx
   ----  = ------
    y       s^2

And we're there! If we integrate the left side we get ln(y), and the right side is
-x² / 2s²

y = A exp { -x² / 2s² }