Python for Bioinformatics: Jukes-Cantor (2)

Friday, February 26, 2010

Jukes-Cantor (2)

The figure illustrates at least part of the reason that we need models of sequence evolution. It comes from a very nice book by Page & Holmes.

What I want to do here is to follow the derivation of the equations for P_XX and P_XY as a function of time, as developed in Higgs & Atwood. This isn't really necessary from a mathematical viewpoint, since we have already guessed the equations, but it's a fun argument.

Consider the following path: we start with an A at some position at time-zero, and after time t + Δt we observe that it is still A, but realize that at a short time prior to the second observation it might have been any nucleotide (since we weren't looking then):

  t            Δt
A => [A,C,G,T] => A

There are four possible paths to get from A to A, through each of the possible intermediates. We sum over the probabilities... We have:

P_AA(t + Δt) = α*Δt*(P_AC(t))
            + α*Δt*(P_AG(t))
            + α*Δt*(P_AT(t))
            + (1 - 3*α*Δt)*(P_AA(t))

We can expand the last term to:

P_AA(t) - 3*α*Δt*P_AA(t)

A wee bit o'calculus. We want to know P_AA(t + Δt). Since Δt is small, we can take the value of the function at t and correct it by adding the slope of the function (at t) times Δt. That is:

P_AA(t + Δt) = P_AA(t) + d/dt P_AA(t) * Δt

So we substitute this expression for the left-hand side of the first equation and then notice that we can subtract P_AA(t) from both sides, leaving:

d/dt P_AA(t) * Δt = α*Δt*(P_AC(t) + P_AG(t) + P_AT(t)) - 3*α*ΔtP_AA(t)

Since Δt occurs in each term on both sides it cancels (which is really the whole point of this). Also the sum of the three P_AX(t) terms is equal to 1 - P_AA(t), and so we have:

d/dt P_AA(t) = α*(1 - P_AA(t)) - 3*α*P_AA(t)
    = α - 4*αP_AA(t)

The rate of change of P_AA(t) is proportional to P_AA(t), which is pretty obvious when you think about it, and so the form of the equation is an exponential:

P_AA(t) = A*e^-4*α*t + B

We need the -4α in the exponent, so that it will come out front when we take the derivative (see here).

We evaluate the constants A and B by considering the boundary conditions, namely, at long times P_AA(t) = 1/4, so B = 1/4; and P_AA(0) = 1, so A + B = 1 and A = 3/4.

P_AA(t) = 1/4 + 3/4*e^-4*α*t

Since the other three P_AX(t) are all equal and also equal to 1 - P_AA(t), we have:

3 * P_AX(t) = 1 - P_AA(t)
           = 3/4 - 3/4*e^-4*α*t
P_AX(t)     = 1/4 - 1/4*e^-4*α*t

which is just what we said the other day!