Sunday, January 10, 2010

Z-scores

As in several recent posts, I've been exploring PyCogent (docs). In fact, I hope to contribute to the cookbook documentation---"PyCogent for dummies," in a sense. I'll post about it here if and when my stuff is accepted.

PyCogent uses Numpy extensively, and I'm not really familiar with that either, so I've been fiddling with it today. A goal for the future would be to explore how one would use Python (and Numpy and Scipy) for processing microarray data. I know that Bioconductor has become standard---in fact I gave a talk about it. But R is ... ugh.

So, the first thing you want to do with the data from a microarray experiment is to put it into an array :), and then, likely, normalize the data. And that leads to the question of how to deal with missing values---nan (not a number). The problem is that if you run numpy.mean or numpy.std on a 1D array containing an nan value, you get back nan.

I couldn't find a function to do this, so I decided to roll my own for fun. It doesn't have to be super efficient because you only do this once or twice, and Python is already plenty fast enough. And please, if anybody knows how to do this better, let me know.

At the bottom of the post is the code. This is the output:

[  2.29908899  -0.63725912  23.02275554  -5.0692318 ]
mean 4.9038 stdev 10.7848
[-0.24152091 -0.51378873 1.6800454 -0.92473577]
[-0.24152091 -0.51378873 1.6800454 -0.92473577]

[-13.41031373 -0.40077508 11.72402522 -3.76176703]
mean -1.4622 stdev 8.9868
[-1.32952082 0.11811049 1.46729289 -0.25588257]
[-1.32952082 0.11811049 1.46729289 -0.25588257]

In the two blocks above, we print two randomly chosen rows, with the mean and stdev. Then we manually calculate the z-score, and compare it to the output from the code. In the second part of the output, we find the first two rows that contain NaN's and check to see that they are handled correctly.

[ -9.29748571  10.472215            NaN  -4.96373691]
[-0.94695253 1.38312492 NaN -0.4361724 ]
[-12.34288901 NaN 11.73414048 -1.49869538]
[-1.18230539 NaN 1.26317611 -0.08087072]


from numpy import nan, isnan
from numpy import array, mean, std, random

def z_score(row):
L = [n for n in row if not isnan(n)]
m = mean(L)
s = std(L)
zL = [1.0 * (n - m) / s for n in L]
if len(L) == len(row): return zL
# deal with nan
retL = list()
for n in row:
if isnan(n):
retL.append(nan)
else:
retL.append(zL.pop(0))
assert len(zL) == 0
return retL

def z_scores_by_row(A):
retL = [z_score(r) for r in A]
Z = array(retL)
Z.shape = A.shape
return Z

# utilities

def show(row):
m = mean(row)
s = std(row)
print row
print 'mean', round(m,4), 'stdev', round(s,4)
zL = (row - m) / s
print zL

def has_nan(row):
for n in row:
if isnan(n): return True
return False

#=======================================

if __name__ == '__main__':
random.seed(137)
N = 10000
A = random.randn(N)
A *= 10
for i in range(20):
A[random.randint(N)] = nan
A.shape = (2500, 4)

Z = z_scores_by_row(A)

for i in range(2):
j = random.randint(1000)
#print j,
show(A[j])
print Z[j]
print

L = [(i,r) for i,r in enumerate(A) if has_nan(r)]
for t in L[:2]:
i = t[0]
print A[i]
print Z[i]

1 comment:

jmerk said...

numpy has a module dealing with masked arrays made for this, numpy.ma. you can mask invalid values, such as np.nan, and then use the numpy.ma functions, or numpy.ma.compressed to return an array with masked values removed.