Python for Bioinformatics: Correction term for the mean

Saturday, November 13, 2010

Correction term for the mean

A few days ago I asked a question on Stack Exchange about the "correction term for the mean" and two different ways of calculating the sum of squares for variance (original post here).

The answer as formulated by Srikant Vadali is developed elegantly in the first link, but recapping with respect to the math, the answer is pretty easy when you know how ;-)

If X is an array of numbers and

n = len(X)
m = sum(X)/n

written another way:

m n = sum(X)

then expanding the original form of the sum of squares:

sum(X - m)² = sum( X² - 2 m X + m² )
       = sum(X²) - sum(2 m X) + sum(m²)

But m is a constant so it can move in front of the summation. The second term is then

-sum(2 m X) = -2 m sum(X)
    = -2 m² n

And the third term is

sum(m²) = m² n

They add to give

- m² n

but now we can go back to sum(X):

- m sum(X)

and the whole thing is

sum(X - m)² = sum(X²) - m sum(X)

which was "precisely what was required to be proved." (here)

I think the existence of the family of sites including Stack Exchange and Stack Overflow is just fantastic.

More interesting than my transcription of Srikant's derivation is that his markup looks so much prettier. The source for the first line is

$\sum_i(X_i-m)^2 = \sum_i(X_i^2 + m^2 - 2 X_i m)$

which is obviously not rendering correctly on this page but looks a bit like LaTeX. Anybody know how this is done or how one could do it in Blogger? Time for another question, I guess.