Python for Bioinformatics: Simpson's paradox

Tuesday, July 21, 2009

Simpson's paradox

In Dennis Lindley's book, Understanding Uncertainty, I came across a striking statistical paradox I had never heard about but which seems to be well known to statisticians of all ages. It's called Simpson's Paradox. The wikipedia entry has such a good example that I will shamelessly appropriate it:

Here are the batting averages for two major league baseball players for consecutive seasons in the 90's:

               1995              1996
A    .250   (12/48)    .314 (183/582)
B    .253 (104/411)    .321  (45/140)

           Combined
A    .310 (195/630)
B    .270 (149/551)

Player A's average was much higher considering the two seasons together, but in each individual year, player B had better numbers.

The key to understanding how this works is to notice that the number of attempts in individual years varies markedly, with player A having just 48 attempts in 1995 and player B having only 140 in 1996. This makes the comparisons for individual years depend on results that do not contribute so much to the combined total.

Remarkably, the situation continues even in the third season, although the combined totals are now pretty close:

               1997            Combined
A    .291 (190/654)    .300  (385/1284)
B    .329 (163/495)    .298  (312/1046)

Player A:  Derek Jeter
Player B:  David Justice

In a medical situation, one could get a similar result with A and B being different treatments, and the years corresponding to some "lurking variable" (sex is a common example). The moral of the story: if you wish to break down results by some factor, you need to have similar sample sizes from the two groups.

Here is another striking example involving vectors, which I generated using R code from wikipedia.

Individual trials 1 and 2 show that red has a higher "success rate", but when the two trials are combined blue is better. There are other great examples in the article.