Python for Bioinformatics: Normal distribution construction in Python

Tuesday, November 9, 2010

Normal distribution construction in Python

Here is an example with the normal distribution that will seem trivial after the t-distribution (here).

The basic form of the normal is exp {-x²/2}. We define that as a Python function f(x), vectorize it, and construct an array X of discrete points from -10 to +10 with interval dx = 0.001. We apply the vectorized function to the array to get the relative densities. So that we obtain the correct area under the curve, we multiply the height (value of f(x)) of each piece by its width, dx. When we sum up all the pieces, the total is equal to √2π as seen in the printout. We divide by this value to normalize the distribution so that its total area is equal to 1 and it becomes a pdf.

The form of the normal that includes a term for the standard deviation is normal(x). Everything is as before, except we substitute z for x, and at the end we find that the normalizing constant is 1/σ√2π. We plot it to have something pretty to look at.

[UPDATE:
In my discussions of probability distributions of late I've played a little fast and loose with terminology. The pdf (probability density function) has a value for any x. At x = μ the exponential term is equal to 1, and the value is 1/σ√2π. For the standard normal, this is about 0.4.

>>> 1.0/sqrt(2*pi)
0.3989422804014327

Of course, this is somewhat misleading, since the probability that x = any particular point approaches zero, because x is a real number with infinitely many values. The way we actually use the pdf is to ask what is the probability for a particular window, a range of values for a < x < b, and at least conceptually, this is done by integrating the function between these limits. That's where the discrete version that I developed comes in handy. We evaluate p(x) for a small enough interval and then multiply by the interval size to get a probability for that slice. To integrate between limits, just add the included slices. But for this to work, we have to "normalize" the pdf so that the total of all the little rectangles of width dx is equal to 1. It also helps to have the discrete version in developing the cdf (cumulative distribution function), since we can just accumulate the pdf as we move along the values of X generating the cdf as we go.

Technically, the pdf(x) is the slope of the cdf(x), which gets around the issue mentioned above for a continuous function. ]

output:

2.507 2.507
5.013 5.013

code listing:

from __future__ import division
import numpy as np
import math
import matplotlib.pyplot as plt

@np.vectorize
def f(x):
return math.e**(-0.5*(x**2))

@np.vectorize
def normal(x,mu=0,sigma=1):
z = (x-mu)/sigma
return math.e**(-0.5*(z**2))
#==================================
dx = 0.001
X = np.arange(-10,10+dx,dx)
pdf = f(X)
pdf *= dx
print round(sum(pdf),3),
S = math.sqrt(2*math.pi)
print round(S,3),
pdf /= S

sigma = 2
pdfn = normal(X,0,sigma)
pdfn *= dx
print round(sum(pdfn),3),
Sn = math.sqrt(2*math.pi) * sigma
print round(Sn,3),
pdfn /= Sn
#==================================
plt.plot(X,pdf,color='r',lw=4)
plt.plot(X,pdfn,color='k',lw=4)
ax = plt.axes()
ax.set_xlim(-6,6)
m = max(pdf)
ymin,ymax = -m/100,m*1.1
ax.set_ylim(ymin,ymax)
plt.text(3,0.75*ymax,s='$\sigma = 1$',
color='r',fontsize=24)
plt.text(3,0.65*ymax,s='$\sigma = 2$',
color='k',fontsize=24)
plt.savefig('example.png')