In the simulation of sequence evolution from last time, there is a connection with the Poisson distribution that was also a subject of recent posts (here and here).
Consider an individual position in the sequence, say the first nucleotide. If we ask the question, what is the probability that the initial T has changed to another nucleotide after some number of trials, you can see that this should be described by the Poisson distribution. We have many cycles, and for an individual cycle the probability that the first nucleotide will change is very low (p = 0.01); after 100 cycles, on the average, the mean number of changes is 1. Therefore, we expect the number of unchanged nucleotides to be:
So the Hamming distance should be 1 minus 1/e, which is 0.632. But inspection of the plot, or printing out the values for each run:
shows that the actual value is more like 0.56.
What's going on?
As the number of mutations increases, there is an increase in the probability of targeting a position that was already mutated in a previous round. This is what we were saying before, that the non-linearity is due to this effect. Now we need to recognize that back mutation can occur, restoring the original nucleotide. This depresses the mutation frequency below that predicted from the Poisson distribution. To see that this explanation is correct, we can modify the code to mark the changed bases as lower case.
With this single difference, we now obtain the values predicted by the Poisson.
How about that?