## Friday, February 26, 2010

### Jukes-Cantor (3)

From last time, we have two equations for sequences changing according to Jukes-Cantor:

 `PXX(t) = 1/4 + 3/4*e-4*α*tPXY(t) = 1/4 - 1/4*e-4*α*t`

We can look at this as the probability that a single site will change in this way (from X to Y) over time, but we can also look at it as the fraction of a collection of sites that will change. Since Y can be any one of three nucleotides, the total fraction of sites that differ between the ancestral sequence and a present-day descendant sequence is three times PXY(t) or:

 `PXN(t) = 3/4 - 3/4*e-4*α*t`

Two present-day homologs (common ancestor) have effectively evolved for twice the time because there are two stretches of evolution of time t. The proportion of sites that differ is:

 `p = 3/4 - 3/4*e-8*α*t`

The above equation is what we observe when we look at the sequences. However, our estimate of the true distance, or actual number of substitutions per site:

 `d = 2*t * 3*α = 6*α*t`

We usually do not know either α or t individually, but we can say that:

 `α*t = d/6p = 3/4 - 3/4*e-8*α*t4/3*p = 1 - e-8/6*dd = -3/4 * ln(1 - 4/3*p)`

This is what we've been after. These equations relate the actual evolutionary distance to the observed changes and vice-versa.

p = proportion or fraction of sites that are observed to be different
d = distance or actual number of substitutions per site

Their relationship is plotted at the top.

Plot code:

 `from math import eimport numpy as npimport matplotlib.pyplot as pltdef d(at): return 6*atdef p(at): return -0.75*e**(-8*at) + 0.75at = A = np.arange(0,2.001,0.001)plt.plot(at,d(at),lw=3,color='r')plt.plot(at,p(at),lw=3,color='b')ax = plt.axes()ax.set_xlim(0, 1.0)ax.set_ylim(0, 2.0)ax.grid(True)plt.savefig('example.png')`