Monday, November 8, 2010

Student's t-test again 7

Interlude: calculating the sum of squares

This is the seventh in a series of posts about Student's t test. The last post has a list of all the previous posts here.

The PyCogent two-sample t-test uses an interesting method to calculate the sum of squares (X - mean(X))**2. Here is my transcription and a comparison to a more usual approach:


from __future__ import division
import numpy as np
np.random.seed(153)

def pycogent_sumsq(X):
return sum(X**2) - sum(X)**2 / len(X)

def my_sumsq(X):
X_bar = np.mean(X)
squares = (X - X_bar)**2
return sum(squares)

for i in range(5):
X = np.random.normal(0,2,5)
print pycogent_sumsq(X), my_sumsq(X)

We do five trials and compare the results.

15.0864873521 15.0864873521
14.2528488226 14.2528488226
8.86248689793 8.86248689793
22.3695874949 22.3695874949
37.0783435439 37.0783435439

Clearly, it works. The question is, how?
Using the code listed at the bottom of the post, examine the calculation.

pycogent_sumsq
X 3.003 0.99 -2.319 5.699 -0.03
X**2 9.019 0.98 5.379 32.483 0.001
sum(X**2) 47.86
sum(X) 7.34
sum(X)**2 53.92
sum(X)**2/len(X) 10.78
sum(X**2) - sum(X)**2/len(X) 37.08
pycogent_sumsq(X) 37.08

my_sumsq
X - X_bar 1.535 -0.478 -3.788 4.231 -1.499
(X - X_bar)**2 2.355 0.229 14.348 17.9 2.247
sum((X - X_bar)**2) 37.08
my_sumsq(X) 37.08

But I am stuck in the process of figuring out why they are equivalent.

(1) sum(X**2)  =   x12 + x22 + ...
(2) sum(X)**2/n = (x1 + x2 + ... )2 / n

m = mean(X) = sum(X) / n
= (x1 + x2 + ... ) / n

(1) - (2) = (x12 + x22 + ... ) - m (x1 + x2 + ... )
= x12 - mx1 + x22 - mx2 + ...
= x1(x1 - m) + x2(x2 - m) + ...

I don't seem to be able to reach:

                     = (x1 - m)2 + (x2 - m)2 + ...

Any ideas?

[UPDATE: I got an answer from Srikant Vadali here, and will discuss it sometime soon.]

Remainder of the code listing:


def show(s,f):
print s.ljust(30),
try:
print str(round(f,2)).rjust(6)
except TypeError:
for n in f:
print str(round(n,3)).rjust(6),
print

print
print 'pycogent_sumsq'
show('X', X)
show('X**2', X**2)
show('sum(X**2)', sum(X**2))
show('sum(X)', sum(X))
show('sum(X)**2', sum(X)**2)
show('sum(X)**2/len(X)',sum(X)**2/len(X))
show('sum(X**2) - sum(X)**2/len(X)', sum(X**2) - sum(X)**2 / len(X))
show('pycogent_sumsq(X)',pycogent_sumsq(X))
print
print 'my_sumsq'
X_bar = np.mean(X)
show('X - X_bar', X - X_bar)
show('(X - X_bar)**2', (X - X_bar)**2)
show('sum((X - X_bar)**2)', sum((X - X_bar)**2))
show('my_sumsq(X)', my_sumsq(X))

No comments: