Python for Bioinformatics: Bayes 7: binomial proportion example

Sunday, June 14, 2009

Bayes 7: binomial proportion example

In this post, we'll work through a simple binomial proportion problem using a Bayesian approach, following Bolstad, compare it with the classical frequentist method, and then check our results against those obtained with the R functions provided in the Bolstad package.

Suppose we want to estimate the proportion of a population who plan to vote for the incumbent in the next election. We take a random sample of n = 200 individuals from the population and observe:

y = 68 (individuals answering "yes")
n-y = 132 (individuals answering "no")
n = 200

The distribution of y is binomial(n=200,p).

We have no prior information, so we should probably choose a beta(1,1) distribution for the prior. We use the updating rule to generate the posterior probability density for p, it is:

beta(1+y,1+n-y)
beta(69,133)

We calculate the mean and variance of the distribution:

a = 69
b = 133:
m = a / a+b
= 69 / 202 = 0.3416

Var = a*b / ((a+b)**2 * (a+b+1))
= 69*133 / ((202)**2 * (203))
= 9177 / (40804 * 203)
= 0.0011

And the standard deviation is:

s = sqrt(Var)
= 0.0333

Since a and b are large enough (roughly, > 10) we can approximate this beta distribution by a normal distribution with the same mean and variance. Then, we calculate the estimated 95 % credible interval simply as

95 % CI = m -/+ 1.96 * s
= 0.3416 -/+ 1.96 * 0.0333
= 0.3416 -/+ 0.0653
= (0.2763, 0.4069)

Applying these values to a total n = 200, we find:
m = 68.3
s = 6.6
95 % credible interval (CI) = (55.26, 81.38)

The standard frequentist analysis for the same problem would give a maximum likelihood estimate for p as:

p = y / n = 0.34

mean = n*p = 68

Var = n * p * (1-p)
= 200 * 0.34 * 0.66
= 44.88

s = 6.7

We can approximate this binomial distribution as a normal distribution with the same mean and variance. We calculate the 95 % confidence interval as:

95 % confidence interval (CI) = m -/+ 1.96 * s
= 68 -/+ 1.96 * 6.7
= 68 -/+ 13.132
= (54.9, 81.1)

Which is not much different than with the non-informative prior. In fact, according to my understanding, they should be the same, so perhaps I am doing something wrong.

The difference between the two approaches is that with the Bayesian analysis we are justified in saying that the probability is 95 % that the true mean lies within the 95 % credible interval. Actually, we can use the posterior distribution to calculate the probability that it lies in any particular interval.

In contrast, the frequentist perspective is more subtle and less general. Given a pre-determined p value, the CI either contains the mean or it does not, with significance at the level = p.

Here is the R code using the Bolstad package:

library(Bolstad)
binobp(68,200,1,1)

Posterior Mean           :  0.3415842 
Posterior Variance       :  0.0011079 
Posterior Std. Deviation :  0.0332852 

Prob. Quantile 
------ ---------
0.005 0.2591665
0.01 0.2666906
0.025 0.2779134
0.05 0.287724
0.5 0.3410604
0.95 0.3972323
0.975 0.4082264
0.99 0.4210788
0.995 0.4298666

One great thing about this package is that we can easily use other priors. For example we could imagine a discrete, trapizoidal-shaped prior with a flat top and corners at: 0.1,0.2,0.6,0.8.

library(Bolstad)
pi = seq(0,1,length=11)
pi.prior = c(rep(0,2),rep(0.1,5),rep(0,4))
binodp(68,200,pi=pi,pi.prior=pi.prior)