Python for Bioinformatics: Sample variance

Tuesday, August 4, 2009

Sample variance

I'm still struggling to improve my grasp of statistics. I have Ewens & Grant, but in some places it seems to be beyond me. The question that I'm after today is why, when we compute the sample variance, we divide by n-1 rather than n. There is an explanation in the book, which I'll try to work through in the future. Here is a bit of R code to simulate a number of samples from a normal distribution, which shows this is correct. First, we are going to write our own functions to compute the sum of squares of differences from the mean, and then two functions which use that result to calculate the population variance and the sample variance.

sumsq <- function(x) {
  S = sum(x**2)
  L = length(x)
  m = mean(x)
  return (S - (L*(m**2))) }

var.sample <- function(x) {
  return (sumsq(x) / (length(x)-1) ) }
var.population <- function(x) {
  return (sumsq(x) / length(x) ) }

(On editing this, I realize that I should have computed the number of items---length(x)---in the var functions and then passed it into the sumsq function rather than compute it twice.)

We can see that the default var function in R uses the sample method:

set.seed(157)
x = rnorm(50)
var(x)
var.sample(x)
var.pop(x)

> var(x)
[1] 1.253426
> var.sample(x)
[1] 1.253426
> var.pop(x)
[1] 1.228358

Let's grab some numbers from rnorm and organize them as a matrix. Since we haven't specified a mean and standard deviation, they will get the default values μ = 0 and σ = 1.

x = rnorm(20000)
m = matrix(x,ncol=10,byrow=T)

Now we use the built-in function "apply", the second argument (=1) says to apply the indicated function to m by rows.

w = apply(m,1,var.population)
mean(w)
w = apply(m,1,var.sample)
mean(w)

> w = apply(m,1,var.population)
> mean(w)
[1] 0.9034893
> w = apply(m,1,var.sample)
> mean(w)
[1] 1.003877

par(mfrow=c(1,2))
hist(w,col='blue',freq=F)
hist(w,col='magenta',freq=F)

No question about it. Dividing by n underestimates the population variance, while n-1 looks good. It provides, as they say, an unbiased estimator. I did not appreciate that the sample means have a strong positive skew. That's interesting. Looks like some kind of gamma distribution.

It also works for the uniform distribution:

x = runif(20000)
m = matrix(x,ncol=10,byrow=T)
w = apply(m,1,var.population)
mean(w)
w = apply(m,1,var.sample)
mean(w)
var(x)

par(mfrow=c(1,2))
hist(w,col='blue',freq=F)
hist(w,col='magenta',freq=F)

> w = apply(m,1,var.population)
> mean(w)
[1] 0.07463633
> w = apply(m,1,var.sample)
> mean(w)
[1] 0.08292925
> var(x)
[1] 0.08251885