Python for Bioinformatics: Confidence interval for sample mean

Wednesday, August 5, 2009

Confidence interval for sample mean

"The long run is a misleading guide to current affairs. In the long run we are all dead."---Keynes

"The standard error of the mean (SEM) is the standard deviation of the sample mean estimate of a population mean...If the data are assumed to be normally distributed, quantiles of the normal distribution and the sample mean and standard error can be used to calculate approximate confidence intervals for the mean." (wikipedia)

The Central limit theorem says that if we take samples of sufficient size from a distribution, and then calculate the sample mean x, and the sample standard deviation s, and then divide by √n to get the standard error of the means (sem), x is normally distributed with standard deviation = sem. Thus, the sample mean x -/+ 1.96 * sem will include the true mean at least 95% of the time. However this does require n to be "sufficiently large."

Here is a simulation. It makes a very striking figure. We will actually draw our samples from the normal distribution, but it should work even if we don't. We draw 50 samples of size N from Normal(50,10²), plot the sample mean as a circle, and draw a line out to -/+ 2 * sem. The dashed lines indicate the population mean μ and one population standard deviation σ.

N = 4

So you can see that in the 50 replicates we have---I count 7---trials in which the sample mean -/+ 2 * sem did not include the true mean. This is not unusual for N=4, as you will see if you do it yourself a number of times. We will do better with larger samples.

[Update: Somehow I forgot to mention the t distribution here! Gosset discovered a correction which helps for small N. If N = 4, we're supposed to multiply by 2.57 rather than by 1.96. If N < 10, it makes a real difference. Oops.]

Here is the R code. We vary N to get the different plots.

N = 4 # numbers per group
G = 50  # groups
M = 50  # mean
S = 10  # stdev
v = numeric()  # means
t = numeric()  # top
b = numeric()  # bottom
for (n in 1:G) {
  w = rnorm(N,M,S)
  m = mean(w)
  s = sd(w)
  sem = s / sqrt(N)
  v[n] = m
  t[n] = m + 2*sem
  b[n] = m - 2*sem }

We do much better with this one. There is only a single sample where the sample mean -/+ 2 * sem lies outside the true population mean -/+ σ:

N = 25

Here is the code for the plots. I adjusted ylim for some of them. And I made the initial plot with type='n' (no plot), so that I could add the points after the vertical lines. (I wanted them to lie on top).

plot(1:G,v,
  xlab='replicate',
  ylim=c(25,75),type='n')
for (i in 1:G){
  lines(c(i,i),
    c(t[i],b[i]),lwd=2) }
points(1:G,v,pch=16,
  cex=2,col='red')

X = 10
lines(c(0,G),c(50,50),lty=2)
lines(c(0,G),c(50+X,50+X),lty=2)
lines(c(0,G),c(50-X,50-X),lty=2)

And here are some plots for other values of N:

N = 2 is just crazy. We include the true mean a little over half the time.

N = 2

N = 9

N = 16

N = 36