## Friday, February 26, 2010

### Jukes-Cantor (2)

The figure illustrates at least part of the reason that we need models of sequence evolution. It comes from a very nice book by Page & Holmes.

What I want to do here is to follow the derivation of the equations for PXX and PXY as a function of time, as developed in Higgs & Atwood. This isn't really necessary from a mathematical viewpoint, since we have already guessed the equations, but it's a fun argument.

Consider the following path: we start with an A at some position at time-zero, and after time t + Δt we observe that it is still A, but realize that at a short time prior to the second observation it might have been any nucleotide (since we weren't looking then):

 ` t ΔtA => [A,C,G,T] => A`

There are four possible paths to get from A to A, through each of the possible intermediates. We sum over the probabilities... We have:

 `PAA(t + Δt) = α*Δt*(PAC(t)) + α*Δt*(PAG(t)) + α*Δt*(PAT(t)) + (1 - 3*α*Δt)*(PAA(t))`

We can expand the last term to:

 `PAA(t) - 3*α*Δt*PAA(t)`

A wee bit o'calculus. We want to know `PAA(t + Δt)`. Since `Δt` is small, we can take the value of the function at t and correct it by adding the slope of the function (at t) times `Δt`. That is:

 `PAA(t + Δt) = PAA(t) + d/dt PAA(t) * Δt`

So we substitute this expression for the left-hand side of the first equation and then notice that we can subtract PAA(t) from both sides, leaving:

 `d/dt PAA(t) * Δt = α*Δt*(PAC(t) + PAG(t) + PAT(t)) - 3*α*ΔtPAA(t)`

Since `Δt` occurs in each term on both sides it cancels (which is really the whole point of this). Also the sum of the three PAX(t) terms is equal to `1 - PAA(t)`, and so we have:

 `d/dt PAA(t) = α*(1 - PAA(t)) - 3*α*PAA(t) = α - 4*αPAA(t)`

The rate of change of `PAA(t)` is proportional to `PAA(t)`, which is pretty obvious when you think about it, and so the form of the equation is an exponential:

 `PAA(t) = A*e-4*α*t + B`

We need the -4α in the exponent, so that it will come out front when we take the derivative (see here).

We evaluate the constants A and B by considering the boundary conditions, namely, at long times `PAA(t) = 1/4`, so `B = 1/4`; and `PAA(0) = 1`, so `A + B = 1` and `A = 3/4`.

 `PAA(t) = 1/4 + 3/4*e-4*α*t`

Since the other three `PAX(t)` are all equal and also equal to `1 - PAA(t)`, we have:

 `3 * PAX(t) = 1 - PAA(t) = 3/4 - 3/4*e-4*α*tPAX(t) = 1/4 - 1/4*e-4*α*t`

which is just what we said the other day!