Python for Bioinformatics: Jukes-Cantor (3)

Friday, February 26, 2010

Jukes-Cantor (3)

From last time, we have two equations for sequences changing according to Jukes-Cantor:

P_XX(t) = 1/4 + 3/4*e^-4*α*t
P_XY(t) = 1/4 - 1/4*e^-4*α*t

We can look at this as the probability that a single site will change in this way (from X to Y) over time, but we can also look at it as the fraction of a collection of sites that will change. Since Y can be any one of three nucleotides, the total fraction of sites that differ between the ancestral sequence and a present-day descendant sequence is three times P_XY(t) or:

P_XN(t) = 3/4 - 3/4*e^-4*α*t

Two present-day homologs (common ancestor) have effectively evolved for twice the time because there are two stretches of evolution of time t. The proportion of sites that differ is:

p = 3/4 - 3/4*e^-8*α*t

The above equation is what we observe when we look at the sequences. However, our estimate of the true distance, or actual number of substitutions per site:

d = 2*t * 3*α = 6*α*t

We usually do not know either α or t individually, but we can say that:

α*t = d/6
p = 3/4 - 3/4*e^-8*α*t
4/3*p = 1 - e^-8/6*d
d = -3/4 * ln(1 - 4/3*p)

This is what we've been after. These equations relate the actual evolutionary distance to the observed changes and vice-versa.

p = proportion or fraction of sites that are observed to be different
d = distance or actual number of substitutions per site

Their relationship is plotted at the top.

Plot code:

from math import e
import numpy as np
import matplotlib.pyplot as plt

def d(at):  
    return 6*at

def p(at):  
    return -0.75*e**(-8*at) + 0.75

at = A = np.arange(0,2.001,0.001)
plt.plot(at,d(at),lw=3,color='r')
plt.plot(at,p(at),lw=3,color='b')

ax = plt.axes()
ax.set_xlim(0, 1.0)
ax.set_ylim(0, 2.0)
ax.grid(True)
plt.savefig('example.png')