Tuesday, August 4, 2009

The birthday problem

Everyone knows somebody who has the same birthday as they do. And if not, you can go to wikipedia, like I did, to find that both Anton van Leeuwenhoek and Kevin Kline were also born on October 24. In this post, I want to explore this famous problem. If we ignore Feb 29 and assume that births are evenly distributed over the days of the year (which may not be true), then the probability that two individuals chosen at random share the same birthday is 1/365.

So let's think about this famous gathering, and ask the question: if the individual chance that Albert and Max share the same birthday is 1/365, what is the probability that there are at least two people in this group that do share the same special day?

The key, as you probably guess, is that this is a combinations problem. Consider a group of 5 individuals (A-E), if F walks up to the group and introduces herself, there are 5 introductions to make.

If we think about building up a group in steps, then for n individuals the number of introductions that have been made is:

 `Σ 1 + 2 ... + n-1`

This is the problem supposedly solved by Gauss as a young boy.

Another way to think about it is to consider that if each person in the group of n people shakes hands with every other person, there are n * (n-1) hands involved in handshakes, but then we must divide by two to get the unique interactions, since we've counted one hand for both of the interacting partners.

In general, we have the formula for combinations: C(n,k) = n! / (n-k)! k!, where k = 2.

We can solve the birthday problem in a couple of ways. We may say that we have the probability for each pair that they do not share a birthday, which is 364/365. The probability that all the independent combinations do not share a birthday is (364/365)**C(n,2). The probability of the complementary event, that at least one pair does share a birthday, is 1 - (364/365)**C(n,2).

The second approach is to consider the group with 2 people and P = 364/365 that they do not share a birthday. If a new person walks up to the group, there are 363 birthdays which would preserve the "no shared birthday" criterion. The probability of the desired event is then 1 - 364/365 * 363/365..., extended for n-1 steps.

This is easy to program, and I won't bore you with the details, but I will show this pretty plot I made using R:

Now, to the point of the post. I thought about finding a group near the critical size and testing it for the birthday criterion. I'm not such a big sports fan anymore, unless you consider politics a sport. How about the Presidents of the United States? Barack Obama is #44. You can get their vitals from wikipedia, but I found a text version on the web here.

The data needs just a bit of cleanup. One date lacks the comma, one date is listed as April 28th. And one entry has two tabs separating the name and the birthday. And why, exactly, are we presented with Obama's middle name? ("I got my middle name from someone who obviously, never thought I'd be running for President")-video.

Here are the results:

 `James K. Polk November 2Warren G. Harding November 2Millard Fillmore January 7Richard M. Nixon January 9William McKinley January 29Franklin D. Roosevelt January 30Ronald Reagan February 6William Henry Harrison February 9Abraham Lincoln February 12George Washington February 22Andrew Jackson March 15James Madison March 16Grover Cleveland March 18John Tyler March 29Thomas Jefferson April 13James Buchanan April 23Ulysses S. Grant April 27James Monroe April 28Harry S Truman May 8John Kennedy May 29George H. W. Bush June 12Calvin Coolidge July 4George W. Bush July 6John Quincy Adams July 11Gerald R. Ford July 14Barack Hussein Obama August 4Herbert Hoover August 10William J. Clinton August 19Benjamin Harrison August 20Lyndon B. Johnson August 27William Howard Taft September 15Jimmy Carter October 1Rutherford B. Hayes October 4Chester A. Arthur October 5Dwight D. Eisenhower October 14Theodore Roosevelt October 27John Adams October 30Warren G. Harding November 2James K. Polk November 2James A. Garfield November 19Franklin Pierce November 23Zachary Taylor November 24Martin Van Buren December 5Woodrow Wilson December 28Andrew Johnson December 29`

And here is the code:

 `fn = 'presidents.txt'FH = open(fn,'r')data = FH.read().strip()FH.close()L = data.split('\n')def parseDate(s): s = s.strip() y = int(s.split()[-1]) s = s.split(',')[0] m,d = s.split() return {'year':y,'month':m, 'day':int(d) } D = dict()for e in L: t = e.split('\t') name,date = t[0],t[-1] # extra tabs in some D[name] = parseDate(date) for i,k1 in enumerate(D.keys()): for j in range(i): k2 = D.keys()[j] if D[k1]['month'] == D[k2]['month']: if D[k1]['day'] == D[k2]['day']: print k1.ljust(25), print D[k1]['month'],D[k1]['day'] print k2.ljust(25), print D[k2]['month'],D[k2]['day']printmL = ['January','February','March','April', 'May','June','July','August','September', 'October','November','December']def f(k): return mL.index(D[k]['month']),D[k]['day'] for k in sorted(D.keys(),key=f): print k.ljust(25), print D[k]['month'],D[k]['day']`