Sunday, January 24, 2010

The Birthday Problem, revisited

I ran into the Birthday Problem in another context, and didn't recognize it. It's pretty embarassing. (I've even blogged about the issue!). Here's the question on Stack Overflow.

Problem description: take a list of 10 elements and run random.shuffle on them in Python. There are a total of 10! permutations. Shuffle 5000 times, recording a copy of the list each time.

>>> import random
>>> random.seed(137)
>>> L = range(10)
>>> N = 5000
>>> rL = list()
>>>
>>> for i in range(N):
... random.shuffle(L)
... rL.append(L[:])
...
>>>

Then to be safe, check for duplicates among the results. Use set to check for dups, but it only works if the items are immutable (so substitute tuples for lists). There are 3 duplicates!

>>> rL = [tuple(e) for e in rL]
>>> len(set(rL))
4997
>>> tL = list()
>>> for i,t in enumerate(rL):
... if t in tL:
... print rL.index(t), i, t
... tL.append(t)
...
290 3354 (9, 3, 1, 0, 6, 2, 5, 4, 8, 7)
3486 3807 (6, 4, 1, 5, 2, 0, 8, 7, 9, 3)
435 4265 (1, 4, 7, 8, 6, 9, 5, 0, 3, 2)

The number of permutations of 10 elements is much larger than 5000.

>>> 10*9*8*7*6*5*4*3*2*1
3628800

Confirm this by using another method in the random module:

>>> import random
>>> random.seed(137)
>>> L = list()
>>> N = 5000
>>> for i in range(N):
... L.append(random.choice(xrange(3628800)))
...
>>> len(set(L))
4995

Get a rough estimate of how likely this is:

>>> import random
>>> random.seed(1357)
>>> def oneTrial():
... L = list()
... for i in range(5000):
... L.append(random.choice(xrange(3628800)))
... return len(set(L)) == len(L)
...
>>> N = 1000
>>> L = [oneTrial() for i in range(N)]
>>> p = 1.0 * sum(L) / N
>>> print round(p,2)
0.04

Analysis:

One approach is to say that the probability that any two permutations are not the same is:

p = 1.0*3628799/3628800

Very large, but definitely not certainty (p = 1). The number of combinations of 2 elements drawn from 5000 is also large.

5000! / 2! * 4998! 
>>> C = 5000 * 4999 / 2
>>> C
12497500

The probability that at least l combination is the same (shares a birthday) is:


>>> 1 - p**C
0.96806256495611798

The other way is to consider a group of two (lists of 10 elements). The probability that they are not the same is (again)

p = 1.0*3628799/3628800

Now, consider a third list. There are only 3628798/3628800 orders available that would not be a duplication, so the total probability for all three is the product of

p = (1.0*3628799/3628800) * (1.0*3628798/3628800)

And the probability of at least one dup is 1 - p. We extend this for a total of N-2 steps:

>>> N = 5000
>>> num = 3628799
>>> den = 3628800
>>> p = 1.0*num/den
>>> for i in range(N-2):
... num -= 1
... p *= 1.0*num/den
...
>>> 1-p
0.96811305757470723

Close enough.

No comments: