Python for Bioinformatics: Why random mutation does not follow the Poisson?

Friday, February 26, 2010

Why random mutation does not follow the Poisson?

In the simulation of sequence evolution from last time, there is a connection with the Poisson distribution that was also a subject of recent posts (here and here).

Consider an individual position in the sequence, say the first nucleotide. If we ask the question, what is the probability that the initial T has changed to another nucleotide after some number of trials, you can see that this should be described by the Poisson distribution. We have many cycles, and for an individual cycle the probability that the first nucleotide will change is very low (p = 0.01); after 100 cycles, on the average, the mean number of changes is 1. Therefore, we expect the number of unchanged nucleotides to be:

P(k) = λ^k / k! * e^-λ
P(k=0) = λ⁰ / 0! * e^-λ
P(k=0) = e^-λ=  e^-1 = 0.368

So the Hamming distance should be 1 minus 1/e, which is 0.632. But inspection of the plot, or printing out the values for each run:

   5%   0.047  A 252, C 251, G 248, T 249
  10%   0.09   A 256, C 246, G 254, T 244
  50%   0.366  A 246, C 260, G 248, T 246
 100%   0.561  A 254, C 270, G 251, T 225
 200%   0.713  A 241, C 280, G 245, T 234
   5%   0.05   A 250, C 247, G 248, T 255
  10%   0.097  A 245, C 252, G 252, T 251
  50%   0.351  A 235, C 270, G 245, T 250
 100%   0.557  A 251, C 249, G 258, T 242
 200%   0.696  A 228, C 242, G 301, T 229
   5%   0.047  A 252, C 243, G 252, T 253
  10%   0.094  A 248, C 240, G 258, T 254
  50%   0.342  A 262, C 240, G 241, T 257
 100%   0.538  A 257, C 239, G 257, T 247
 200%   0.692  A 257, C 246, G 258, T 239

shows that the actual value is more like 0.56.

What's going on?

As the number of mutations increases, there is an increase in the probability of targeting a position that was already mutated in a previous round. This is what we were saying before, that the non-linearity is due to this effect. Now we need to recognize that back mutation can occur, restoring the original nucleotide. This depresses the mutation frequency below that predicted from the Poisson distribution. To see that this explanation is correct, we can modify the code to mark the changed bases as lower case.

def JC_rates():
    #return { 'A':'CGT', 'C':'AGT','G':'ACT', 'T':'ACG' }
    return { 'A':'cgt', 'C':'agt','G':'act', 'T':'acg',
             'a':'cgt', 'c':'agt','g':'act', 't':'acg' }

With this single difference, we now obtain the values predicted by the Poisson.

 100%   0.636  A 88, C 93, G 87, T 96
 100%   0.634  A 91, C 97, G 86, T 92
 100%   0.629  A 95, C 96, G 95, T 85

How about that?