Python for Bioinformatics: Jukes-Cantor (1)

Friday, February 26, 2010

Jukes-Cantor (1)

It is widely reported that Jukes and Cantor proposed the first model for sequence evolution (often designated JC69)¹. If the probability that a given nucleotide, say A, will change to another nucleotide, T, in a short time dt, is α * dt, then the same rate constant applies to all 12 possible nucleotide interconversions. I say widely reported because the original reference is in a book that is hard to come by, and I would guess that at least some of the people who cite it have not actually seen a copy. If you have an e-copy (doubtful anyone does), I would love for you to send it to me.

We can set this up as a matrix like so:

     A     C     G     T
A  -3α     α     α     α
C    α   -3α     α     α
G    α     α   -3α     α
T    α     α     α   -3α

These are instantaneous rates of change (and only work for short times). We would like to get an equation where the probability of change is expressed as a function of time, so that it could be calculated at any time. In fact, that is what I want to do on the blog, to go through the derivation in Higgs & Attwood. But, for the first post, let's start by peeking at the answer:

Consider two generic nucleotides X and Y. Then the formulae are:

P_XX(t) = 1/4 + 3/4*e^-4αt
P_XY(t) = 1/4 - 1/4*e^-4αt

We can see that these make intuitive sense, by considering the results they generate for time-zero and also for very long times. At time-zero the exponential term equals one, so we have the probability that the nucleotide X has not changed after zero time is, of course equal to one. At very long times, the second term of both equations becomes zero, and they both become equal to 1/4. At long times, no matter what we started with, we have equal probability of each of four nucleotides.

And finally, we see that at any intermediate time, no matter what the value of e^-4αt, since the total of the three P_XY terms with e cancels the single P_XY term, the total probability is always equal to one, as required.

We can also verify that this equation gives (something very close to) the required instantaneous rate upon differentiation. Remembering that d/dt(e^kt) = k*e^kt:

d/dt(P_XX(t)) =  -3*α*e^-4αt
d/dt(P_XY(t)) =   α*e^-4αt

I'm missing the final step. I think what I have to do is compute P(t + dt) - P(t) and that'll be equal to these slopes. I'll report back or if anyone can help me out here, please do.

1. Jukes, TH and Cantor, CR. 1969. Evolution of protein molecules. Pp. 21-123 in H. N. Munro, ed. Mammalian protein metabolism. Academic Press, New York.

Code for the plot:

from math import e
import numpy as np
import matplotlib.pyplot as plt

def pXX(at):
    return e**(-4*at) * 0.75 + 0.25

def pXY(at):
    return -e**(-4*at) * 0.25 + 0.25

at = A = np.arange(0,2.001,0.001)
plt.plot(at, pXX(at),lw=3,color='r')
plt.plot(at, pXY(at),lw=3,color='b')

ax = plt.axes()
ax.set_xlim(0, 2.0)
ax.set_ylim(0, 1.0)
ax.grid(True)
plt.savefig('example.png')