Monday, March 15, 2010

Sampling a distribution

I was listening to an introductory talk about MCMC (link). I'll have more to say about the talk another time (hint: I think it's terrific). The speaker described a method for sampling from a probability distribution. The idea is to generate random samples from U[0,1], and then find the value of x at which the cumulative density function first exceeds the random number. The resulting samples of x are plotted as a histogram.

The Python script has three parts. In the first section we set up a weighted mixture of three normal distributions of different mean and sd, and plot that in red using x increments of 0.01. In the second section the same values are used to generate a discrete cdf for the same points. This is plotted in blue (after normalizing by the interval size).

Then we implement the algorithm for sampling. Binning for the histogram used the collections.Counter class that will be in Python 2.7. I don't have that, so I grabbed the class from here.

You might notice that the labels on the y-axis also cover the histogram. It would be better to have two separate plots, but I'm not quite sure how to handle that yet.

[UPDATE: more here]

import math, sys
import numpy as np
import matplotlib.pyplot as plt
import Counter

def normal(mu,sigma):
def f(x):
z = 1.0*(x-mu)/sigma
e = math.e**(-0.5*z**2)
C = math.sqrt(2*math.pi)*sigma
return 1.0*e/C
return f

p1 = normal(0,2)
p2 = normal(10,1)
p3 = normal(18,0.5)

# sum of weighted normal distributions
def p(x):
return 0.5*p1(x) + 0.25*p2(x) + 0.25*p3(x)

dx = 0.01
xmax = 25
R = np.arange(-10,xmax+dx,dx)
# dashed lines
L = p(R)
cdf = [L[0]]
for e in L[1:]:
cdf.append(cdf[-1] + e)
cdf = np.array(cdf)
cdf *= dx

ax = plt.axes()
def find_first(n,L):
for i,e in enumerate(L):
if n < e: return i
return len(L)

samples = list()
width = 0.4
f = 1/width
for i in range(1000):
n = np.random.random()
# must adjust to actual range
value = find_first(n,cdf)*dx - 10.0
# trick to truncate at fractional values

#samples = np.array(samples)
c = Counter.Counter(samples)
maxn = c.most_common(1)[0][1]
for k in c:
n = c[k]
n = n * 0.45 / maxn
r = plt.Rectangle((k,-0.5),


writeonly said...

One small (potential) improvement to "find_first" would be to make it a binary
search, taking it from O(categories) to O(log categories).

daedrobu87 said...

Hi. I found this code very useful. Another improvement could be to use the numpy.where() function instead of creating an ennumerate and find the first value greater than certain value. For example, to find the values in a function that are greater than a low boundary:

Using built in function could bring a faster performance.