Python for Bioinformatics: Vectorize in numpy

Friday, March 12, 2010

Vectorize in numpy

I'm working on a little project using numpy and matplotlib where I found a need to write my own function that operates on numpy arrays. So, for example, we could have, as a first attempt:

import numpy as np
A = np.arange(0,1,0.1)
def f(n):
    if n > 0.3:  return n
    return 0
print A
print f(A)

$ python vec.py 
[ 0.   0.1  0.2  0.3  0.4  0.5  0.6  0.7  0.8  0.9]
Traceback (most recent call last):
  File "vec.py", line 7, in 
    print f(A)
  File "vec.py", line 4, in f
    if n > 0.3:  return n
ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()

Numpy is trying to tell us that we can't just feed an array to a function that operates on single values. The function has to be "vectorized." Say:

f = np.vectorize(f)
print f(A)

[0 0 0 0 0 0 0 0 0 0]

We didn't get a ValueError, but the result is not correct. What happened? According to the help(np.vectorize):

Data-type of output of vectorized is determined by calling the function with the first element of the input.

The problem is that although the array contains floats, we specified 0 as the value to return for n < 0.3. The first n evaluated was 0.0 which is less than 0.3 and so returns the integer 0. That leads to trouble, since the vectorized function now returns all the rest of the values as ints. Or so the docs imply. (Also see here). There are a couple more issues. First:

A[0] = 2.0
print f(A)
B = np.arange(1,0,-0.1)
print f(B)

[2 0 0 0 0 0 0 0 0 0]
[1 0 0 0 0 0 0 0 0 0]

The output above is obtained if it is run in the same script as the first part. That is, the data type to be returned is only evaluated once, when the function is first called. If the first call is commented out, then we get this:


[ 2.   0.   0.   0.3  0.4  0.5  0.6  0.7  0.8  0.9]
[ 1.   0.9  0.8  0.7  0.6  0.5  0.4  0.3  0.   0. ]

One way to fix the whole problem is to return a float. The following example also uses vectorize as a decorator

@np.vectorize
def f2(n):
    if n > 0.3:  return n
    return 0.0
print f2(A)

[ 0.   0.   0.   0.3  0.4  0.5  0.6  0.7  0.8  0.9]

Continuing in the docs:

This can be avoided by specifying the otypes argument as either a string of typecode characters or a list of data-types specifiers.

We're supposed to do something like:

def f3(n,otypes=[np.float]):
def f3(n,otypes='d'):

But, in my tests, neither of these can compensate for returning the integer 0. The real reason for the otypes argument seems to be that there is a large number of numpy dtypes.