Python for Bioinformatics: Comparing R and Python sequences

Friday, February 5, 2010

Comparing R and Python sequences

This is a post about elementary sequence operations in R and Python. It's as much for me as for you.

The most obvious difference between sequences in R and Python is Python's use of 0-based indexing:

R:

> 1:5
[1] 1 2 3 4 5
> A = seq(1,10,by=2)
> A
[1] 1 3 5 7 9
> A[2]
[1] 3

Python:

>>> range(1,6)
[1, 2, 3, 4, 5]
>>> A = range(1,11,2)
>>> A
[1, 3, 5, 7, 9]
>>> A[1]
3

Another difference is that in R, but not in Python, one can assign to an index outside the initial range:

> m = 1:2
> m[6] = 35
> m
[1]  1  2 NA NA NA 35

>>> m = range(1,3)
>>> m[6] = 35
Traceback (most recent call last):
  File "", line 1, in 
IndexError: list assignment index out of range

In R, but not in regular Python, we can make the increments non-integral values:

> A = seq(0,20,by=0.1)
> A[1]
[1] 0
> length(A)
[1] 201
> A[length(A)]
[1] 20

We can use numpy to get around this restriction:

>>> import numpy as np
>>> A = np.arange(0,20.1,0.1)
>>> A[0]
0.0
>>> len(A)
201
>>> A[-1]
20.0

It's sometimes more convenient to specify how many numbers we want to obtain (evenly spaced in some interval):

> A = seq(0,2,length=6)
> A
[1] 0.0 0.4 0.8 1.2 1.6 2.0

>>> A = np.linspace(0,2,6)
>>> A
array([ 0. ,  0.4,  0.8,  1.2,  1.6,  2. ])

Vectorized operations:

> m = 1:9
> dim(m) = c(3,3)
> m
     [,1] [,2] [,3]
[1,]    1    4    7
[2,]    2    5    8
[3,]    3    6    9
> m = t(m)
> m
     [,1] [,2] [,3]
[1,]    1    2    3
[2,]    4    5    6
[3,]    7    8    9
> apply(m,1,mean)
[1] 2 5 8
> apply(m,2,mean)
[1] 4 5 6
> mean(m)
[1] 5

>>> m = np.arange(1,10)
>>> m.shape = (3,3)
>>> m
array([[1, 2, 3],
       [4, 5, 6],
       [7, 8, 9]])
>>> np.mean(m, axis=0)
array([ 4.,  5.,  6.])
>>> np.mean(m, axis=1)
array([ 2.,  5.,  8.])
>>> np.mean(m)
5.0

Here are some examples of fancy indexing where we rearrange rows and columns both at the same time:

> m
     [,1] [,2] [,3]
[1,]    1    2    3
[2,]    4    5    6
[3,]    7    8    9
> m[c(2,3,1),c(3,2,1)]
     [,1] [,2] [,3]
[1,]    6    5    4
[2,]    9    8    7
[3,]    3    2    1

The naive implementation in Python gives something different than in R (though useful), but the sequential approach works:

>>> m
array([[1, 2, 3],
       [4, 5, 6],
       [7, 8, 9]])
>>> m[[1,2,0],[2,1,0]]
array([6, 8, 1])
>>> m[[1,2,0],:][:,[2,1,0]]
array([[6, 5, 4],
       [9, 8, 7],
       [3, 2, 1]])

R has a few more indexing tricks for which I don't know if there is a Python equivalent:

> m = 1:9
> dim(m) = c(3,3)
> m = t(m)
> m[-1,]
     [,1] [,2] [,3]
[1,]    4    5    6
[2,]    7    8    9

> sel = m[1,] > 2
> sel
[1] FALSE FALSE  TRUE
> m[sel]
[1] 7 8 9

> y = -5:5
> y
 [1] -5 -4 -3 -2 -1  0  1  2  3  4  5
> y[y < 0] <- -y[y < 0]
> y
 [1] 5 4 3 2 1 0 1 2 3 4 5
> y <- abs(y)
> y
 [1] 5 4 3 2 1 0 1 2 3 4 5

But here, finally is an example we can do in both:

> m
     [,1] [,2] [,3]
[1,]    1    2    3
[2,]    4    5    6
[3,]    7    8    9
> sel = array(c(1:3,3:1), dim=c(3,2))
> sel
     [,1] [,2]
[1,]    1    3
[2,]    2    2
[3,]    3    1
> m[sel] = 0
> m
     [,1] [,2] [,3]
[1,]    1    2    0
[2,]    4    0    6
[3,]    0    8    9

>>> m
array([[1, 2, 3],
       [4, 5, 6],
       [7, 8, 9]])
>>> m[[0,1,2],[2,1,0]] = 0
>>> m
array([[1, 2, 0],
       [4, 0, 6],
       [0, 8, 9]])