From last time, we have two equations for sequences changing according to Jukes-Cantor:
We can look at this as the probability that a single site will change in this way (from X to Y) over time, but we can also look at it as the fraction of a collection of sites that will change. Since Y can be any one of three nucleotides, the total fraction of sites that differ between the ancestral sequence and a present-day descendant sequence is three times PXY(t) or:
Two present-day homologs (common ancestor) have effectively evolved for twice the time because there are two stretches of evolution of time t. The proportion of sites that differ is:
The above equation is what we observe when we look at the sequences. However, our estimate of the true distance, or actual number of substitutions per site:
We usually do not know either α or t individually, but we can say that:
This is what we've been after. These equations relate the actual evolutionary distance to the observed changes and vice-versa.
p = proportion or fraction of sites that are observed to be different
d = distance or actual number of substitutions per site
Their relationship is plotted at the top.
Plot code: