Python for Bioinformatics: import random

Friday, August 8, 2008

import random

I'm starting to work on Deonier again. I read about half of the book last summer and did some of the problems, but I need to revisit it. But first, a post about random numbers and choosing random elements from a sequence in Python. We use this a lot in simulations for Bioinformatics and in generating statistical confidence intervals.

Python has a module called "random" for generating such numbers. As the docs say:

Almost all module functions depend on the basic function random(), which generates a random float uniformly in the semi-open range [0.0, 1.0). Python uses the Mersenne Twister as the core generator. It produces 53-bit precision floats and has a period of 2**19937-1. The underlying implementation in C is both fast and threadsafe. The Mersenne Twister is one of the most extensively tested random number generators in existence.

Sometimes I've used an instance of the class random.Random(), but really this shouldn't be necessary unless you need random number generators that don't share state. If you seed the random number generator with the same integer seed, you'll get the same sequence. You can also seed with time.time(), which returns a float (in fact, this is the default if no seed is provided).

import random
R1 = random.Random(157)
b = 'ACGT'
L = [R1.choice(b) for i in range(20)]
print ''.join(L)
# GAACATAAAAGTCCCGACAT

R2 = random.Random(157)
L2 = [R2.choice(b) for i in range(20)]
print ''.join(L2)
# GAACATAAAAGTCCCGACAT

I use the functions "shuffle", "choice", and "sample" a lot. According to the help on my version of Python

Python 2.5.1 (r251:54863, Jan 17 2008, 19:35:17) 
[GCC 4.0.1 (Apple Inc. build 5465)] on darwin

"sample" is supposed to leave the original population unchanged. I expected that it would be suitable for sampling with replacement. But, asking for more elements than exist in the population raises

ValueError: sample larger than population

and the online docs are explicit that this function does sampling without replacement. Here is a recipe that does sampling with replacement suitable for very large sequences:

There are also various functions to allow sampling from distributions, e.g. "gauss" samples come from the normal distribution. In Chapter 2 Deonier wants to sample from the binomial distribution, which in R you do like this:

x <- rbinom(observations,trials,probability)

Each observation consists of a number of trials with the stated probability of success for each trial, and what is reported is the number of successes in the observation.

t = 5
o = 20
p = 0.5
v <- rbinom(o,t,p)
v
[1] 2 2 0 2 2 2 2 3 3 4 3 3 0 4 3 3 1 2 1 2

The Python random module does not provide a function to do this directly. What we can do is make a list which contains elements corresponding to success and failure with the correct ratio. Then we simple choose randomly from those with replacement, by using random.choice.

import random
# suppose p = 0.25
L = 'SFFF'

# trials per observation
R = range(5)

result = list()
for o in range(20):
    result.append(
        [random.choice(L) for t in R])
        
v = [r.count('S') for r in result]
for n in v: print n,
print

# 1 2 0 1 1 2 2 1 3 0 2 1 0 2 1 2 0 2 1 0