Python for Bioinformatics: Geometric distribution

Sunday, August 16, 2009

Geometric distribution

According to wikipedia, the geometric distribution is the probability distribution which describes the number of Bernoulli trials required to obtain a single success.

So, in the simple case of a fair coin (p = 1/2):

P(X=1) = 1/2
P(X=2) = 1/4      (1/2 times 1/2)
P(X=3) = 1/8      (1/2 times 1/4)

According to mathworld, the geometric distribution is the only discrete memoryless random distribution, and is a discrete analog of the exponential distribution.

The memoryless property can be seen easily if we take the geometric series:

1/2 + 1/4 + 1/8 ...

If we have already obtained a failure on the first trial, then we remove the first term (corresponding to success on the first trial) and then normalize by dividing by the sum of all remaining terms (1 - 1/2):

  = 2/4 + 2/8 + 2/16 ...
  = 1/2 + 1/4 + 1/8 ...

The normalization is needed because we require the sum of all the terms to add up to 1 for a proper probability distribution. In general, for process with probability of success p and failure q = 1 - p:

P(X=k) = q^k-1 * p

According to wikipedia, the mean is 1/p and the variance is q/p². We can compare that to the exponential distribution with a pdf of:

λe^-λx

and mean = 1/λ and variance = 1/λ².

It is not clear to me at present why the expressions for the variance don't match up.

R code:

x=numeric(7)
x[1] = 1
for (i in 2:length(x)) 
  { x[i] = x[i-1]/2 }
plot(x,type='s',
  ylim=c(0,1.0),
  col='red',lwd=3)