Saturday, October 30, 2010

Additive variance for addition and subtraction

I've learned that

"for the sum or difference of independent random variables, variances add"

(source)

Here is a Python simulation with numbers drawn from a normal distribution with mean = 0 and std = 2. As expected, the actual variance observed for a sample is not exactly equal to 4, though it approaches this as N increases. What was a surprise to me is that the statement "variances add" is only approximately true for actual data. (That is, unless I've screwed up somehow). Looking back, I saw this before but ignored it.

Here are trials for various N, showing the mean, variance and standard deviations of the samples, the sums of the actual variances, and the variances of the sum and difference.

[UPDATE: I handled the printout of N's value poorly. For example, in the interpreter:
>>> 10e2
1000.0

for N = 100, when what I actually did in code was 10**2. ]


    mean       var       std

N = 10e2
-0.009 3.957 1.989 data1
0.339 3.192 1.787 data2
0.330 7.149 3.776 total
add
0.330 7.101 2.665 combo
sub
-0.348 7.197 2.683 combo
------------------------------
N = 10e3
0.066 3.727 1.931 data1
0.009 4.027 2.007 data2
0.076 7.754 3.937 total
add
0.076 7.753 2.784 combo
sub
0.057 7.755 2.785 combo
------------------------------
N = 10e4
-0.025 4.007 2.002 data1
0.041 4.089 2.022 data2
0.015 8.097 4.024 total
add
0.015 8.087 2.844 combo
sub
-0.066 8.106 2.847 combo
------------------------------
N = 10e7
0.000 4.001 2.000 data1
0.001 3.999 2.000 data2
0.001 8.000 4.000 total
add
0.001 8.001 2.829 combo
sub
-0.000 7.998 2.828 combo
------------------------------



import numpy as np

def show(stats,name=None):
L = [' %3.3f ' % e for e in stats]
print ''.join([e.rjust(10) for e in L]),
if name: print name
else: print

def get_stats(data):
funcs = [np.mean,np.var,np.std]
return np.array([f(data) for f in funcs])

def init(N):
total = np.zeros(3)
f = np.random.normal
rL = list()
for i in range(2):
rL.append(f(loc=0,scale=2,size=N))
stats = get_stats(rL[-1])
show(stats,'data'+str(i+1))
total += stats
show(total,'total')
return rL

def add(A1,A2): return A1 + A2
def sub(A1,A2): return A1 - A2
#---------------------------------------
for e in [2,3,4,7]:
N = 10**e
print 'N = 10e' + str(e)
data1, data2 = init(N)
for f in [add,sub]:
print f.func_name
combo = f(data1,data2)
show(get_stats(combo),'combo')
print '-'*30