Monday, November 8, 2010

Interlude: calculating the sum of squares

This is the seventh in a series of posts about Student's t test. The last post has a list of all the previous posts here.

The PyCogent two-sample t-test uses an interesting method to calculate the sum of squares `(X - mean(X))**2`. Here is my transcription and a comparison to a more usual approach:

 `from __future__ import divisionimport numpy as npnp.random.seed(153)def pycogent_sumsq(X): return sum(X**2) - sum(X)**2 / len(X)def my_sumsq(X): X_bar = np.mean(X) squares = (X - X_bar)**2 return sum(squares)for i in range(5): X = np.random.normal(0,2,5) print pycogent_sumsq(X), my_sumsq(X)`

We do five trials and compare the results.

 `15.0864873521 15.086487352114.2528488226 14.25284882268.86248689793 8.8624868979322.3695874949 22.369587494937.0783435439 37.0783435439`

Clearly, it works. The question is, how?
Using the code listed at the bottom of the post, examine the calculation.

 `pycogent_sumsqX 3.003 0.99 -2.319 5.699 -0.03X**2 9.019 0.98 5.379 32.483 0.001sum(X**2) 47.86sum(X) 7.34sum(X)**2 53.92sum(X)**2/len(X) 10.78sum(X**2) - sum(X)**2/len(X) 37.08pycogent_sumsq(X) 37.08my_sumsqX - X_bar 1.535 -0.478 -3.788 4.231 -1.499(X - X_bar)**2 2.355 0.229 14.348 17.9 2.247sum((X - X_bar)**2) 37.08my_sumsq(X) 37.08`

But I am stuck in the process of figuring out why they are equivalent.

 `(1) sum(X**2) = x12 + x22 + ...(2) sum(X)**2/n = (x1 + x2 + ... )2 / nm = mean(X) = sum(X) / n = (x1 + x2 + ... ) / n(1) - (2) = (x12 + x22 + ... ) - m (x1 + x2 + ... ) = x12 - mx1 + x22 - mx2 + ... = x1(x1 - m) + x2(x2 - m) + ...`

I don't seem to be able to reach:

 ` = (x1 - m)2 + (x2 - m)2 + ...`

Any ideas?

[UPDATE: I got an answer from Srikant Vadali here, and will discuss it sometime soon.]

Remainder of the code listing:

 `def show(s,f): print s.ljust(30), try: print str(round(f,2)).rjust(6) except TypeError: for n in f: print str(round(n,3)).rjust(6), printprintprint 'pycogent_sumsq'show('X', X)show('X**2', X**2)show('sum(X**2)', sum(X**2))show('sum(X)', sum(X))show('sum(X)**2', sum(X)**2)show('sum(X)**2/len(X)',sum(X)**2/len(X))show('sum(X**2) - sum(X)**2/len(X)', sum(X**2) - sum(X)**2 / len(X))show('pycogent_sumsq(X)',pycogent_sumsq(X))printprint 'my_sumsq'X_bar = np.mean(X)show('X - X_bar', X - X_bar)show('(X - X_bar)**2', (X - X_bar)**2)show('sum((X - X_bar)**2)', sum((X - X_bar)**2))show('my_sumsq(X)', my_sumsq(X))`