## Sunday, January 10, 2010

### Z-scores

As in several recent posts, I've been exploring PyCogent (docs). In fact, I hope to contribute to the cookbook documentation---"PyCogent for dummies," in a sense. I'll post about it here if and when my stuff is accepted.

PyCogent uses Numpy extensively, and I'm not really familiar with that either, so I've been fiddling with it today. A goal for the future would be to explore how one would use Python (and Numpy and Scipy) for processing microarray data. I know that Bioconductor has become standard---in fact I gave a talk about it. But R is ... ugh.

So, the first thing you want to do with the data from a microarray experiment is to put it into an array :), and then, likely, normalize the data. And that leads to the question of how to deal with missing values---nan (not a number). The problem is that if you run numpy.mean or numpy.std on a 1D array containing an nan value, you get back nan.

I couldn't find a function to do this, so I decided to roll my own for fun. It doesn't have to be super efficient because you only do this once or twice, and Python is already plenty fast enough. And please, if anybody knows how to do this better, let me know.

At the bottom of the post is the code. This is the output:

 `[ 2.29908899 -0.63725912 23.02275554 -5.0692318 ]mean 4.9038 stdev 10.7848[-0.24152091 -0.51378873 1.6800454 -0.92473577][-0.24152091 -0.51378873 1.6800454 -0.92473577][-13.41031373 -0.40077508 11.72402522 -3.76176703]mean -1.4622 stdev 8.9868[-1.32952082 0.11811049 1.46729289 -0.25588257][-1.32952082 0.11811049 1.46729289 -0.25588257]`

In the two blocks above, we print two randomly chosen rows, with the mean and stdev. Then we manually calculate the z-score, and compare it to the output from the code. In the second part of the output, we find the first two rows that contain NaN's and check to see that they are handled correctly.

 `[ -9.29748571 10.472215 NaN -4.96373691][-0.94695253 1.38312492 NaN -0.4361724 ][-12.34288901 NaN 11.73414048 -1.49869538][-1.18230539 NaN 1.26317611 -0.08087072]`

 `from numpy import nan, isnanfrom numpy import array, mean, std, randomdef z_score(row): L = [n for n in row if not isnan(n)] m = mean(L) s = std(L) zL = [1.0 * (n - m) / s for n in L] if len(L) == len(row): return zL # deal with nan retL = list() for n in row: if isnan(n): retL.append(nan) else: retL.append(zL.pop(0)) assert len(zL) == 0 return retLdef z_scores_by_row(A): retL = [z_score(r) for r in A] Z = array(retL) Z.shape = A.shape return Z # utilitiesdef show(row): m = mean(row) s = std(row) print row print 'mean', round(m,4), 'stdev', round(s,4) zL = (row - m) / s print zL def has_nan(row): for n in row: if isnan(n): return True return False#=======================================if __name__ == '__main__': random.seed(137) N = 10000 A = random.randn(N) A *= 10 for i in range(20): A[random.randint(N)] = nan A.shape = (2500, 4) Z = z_scores_by_row(A) for i in range(2): j = random.randint(1000) #print j, show(A[j]) print Z[j] print L = [(i,r) for i,r in enumerate(A) if has_nan(r)] for t in L[:2]: i = t[0] print A[i] print Z[i]`

#### 1 comment:

jmerk said...

numpy has a module dealing with masked arrays made for this, numpy.ma. you can mask invalid values, such as np.nan, and then use the numpy.ma functions, or numpy.ma.compressed to return an array with masked values removed.