## Sunday, January 24, 2010

### The Birthday Problem, revisited

I ran into the Birthday Problem in another context, and didn't recognize it. It's pretty embarassing. (I've even blogged about the issue!). Here's the question on Stack Overflow.

Problem description: take a list of 10 elements and run `random.shuffle` on them in Python. There are a total of 10! permutations. Shuffle 5000 times, recording a copy of the list each time.

 `>>> import random>>> random.seed(137)>>> L = range(10)>>> N = 5000>>> rL = list()>>> >>> for i in range(N):... random.shuffle(L)... rL.append(L[:])... >>>`

Then to be safe, check for duplicates among the results. Use set to check for dups, but it only works if the items are immutable (so substitute tuples for lists). There are 3 duplicates!

 `>>> rL = [tuple(e) for e in rL]>>> len(set(rL))4997>>> tL = list()>>> for i,t in enumerate(rL):... if t in tL:... print rL.index(t), i, t... tL.append(t)... 290 3354 (9, 3, 1, 0, 6, 2, 5, 4, 8, 7)3486 3807 (6, 4, 1, 5, 2, 0, 8, 7, 9, 3)435 4265 (1, 4, 7, 8, 6, 9, 5, 0, 3, 2)`

The number of permutations of 10 elements is much larger than 5000.

 `>>> 10*9*8*7*6*5*4*3*2*13628800`

Confirm this by using another method in the random module:

 `>>> import random>>> random.seed(137)>>> L = list()>>> N = 5000>>> for i in range(N):... L.append(random.choice(xrange(3628800)))... >>> len(set(L))4995`

Get a rough estimate of how likely this is:

 `>>> import random>>> random.seed(1357)>>> def oneTrial():... L = list()... for i in range(5000):... L.append(random.choice(xrange(3628800)))... return len(set(L)) == len(L)... >>> N = 1000>>> L = [oneTrial() for i in range(N)]>>> p = 1.0 * sum(L) / N>>> print round(p,2)0.04`

Analysis:

One approach is to say that the probability that any two permutations are not the same is:

`p = 1.0*3628799/3628800`

Very large, but definitely not certainty (p = 1). The number of combinations of 2 elements drawn from 5000 is also large.

 `5000! / 2! * 4998! >>> C = 5000 * 4999 / 2>>> C12497500`

The probability that at least l combination is the same (shares a birthday) is:

 `>>> 1 - p**C0.96806256495611798`

The other way is to consider a group of two (lists of 10 elements). The probability that they are not the same is (again)

`p = 1.0*3628799/3628800`

Now, consider a third list. There are only 3628798/3628800 orders available that would not be a duplication, so the total probability for all three is the product of

`p = (1.0*3628799/3628800) * (1.0*3628798/3628800)`

And the probability of at least one dup is 1 - p. We extend this for a total of N-2 steps:

 `>>> N = 5000>>> num = 3628799>>> den = 3628800>>> p = 1.0*num/den>>> for i in range(N-2):... num -= 1... p *= 1.0*num/den... >>> 1-p0.96811305757470723`

Close enough.