Tuesday, December 7, 2010

Models of sequence evolution 1

I've decided to work on Yang's Computational Molecular Evolution again. It's not an elementary book, but I like his style and clarity of explanation a lot. The goal is to understand models of sequence evolution, and more than that, to see how starting with the substitution-rate matrix Q, we can get to the transition probability matrix P. Once we have P, we can crank it as many times as we need to.

Q is a 4 x 4 matrix with the instantaneous rates for each nucleotide to change to another, or stay the same.

Everybody has a different order for the nucleotides! I usually use ACGT (because then I don't have to think twice about it), wikipedia uses AGCT some places and something else other places. Yang uses TCAG, and we'll stick with that since we're borrowing his whole exposition.

Usually, the columns of Q represent the instantaneous rate of going to a particular state, so row 1 is T => T (TT), TC TA and TG. There are exceptions (wikipedia some places, Dayhoff), but this is pretty standard.

I'm going to start with a more advanced model, called TN93 (after Tamura & Nei 1993 PMID 8336541). For TN93 we have 3 exchangeability parameters (α1, α2, β): these describe the rate at which either of the two transitions, or all transversions, occur.

α1 TC
α2 AG
β transversions

The model is time-reversible, so the exchangeability parameter for MN applies also to NM.

The equilibrium nucleotide frequencies are πN. These are the frequencies to which a sequence would tend by many sequential applications of P. So here is Q. Notice the symmetry:

Q = - α1πC βπA βπG
α1πT - βπA βπG
βπT βπC - α2πG
βπT βπC α2πA -

For example, the instantaneous rate of change from T to C is Q12, which is α1πC. And that's counterintuitive: it involves πC, even though that's where we're going.

There are two other rules. First, the nucleotide frequencies must sum to one: Σ πi = 1. Also the sum of rates in each row equals zero, so we calculate the diagonals accordingly.

Now, how do you get from Q to P? Here is the equation, which looks simple but definitely is not:

P(t) = eQt

But we can avoid using it for the moment. Yang gives an analytical solution (i.e. which is explicit in terms of the above parameters) for TN93. That's probably why he chose this example, because it's complex enough to be realistic but also has this kind of solution.

The eigenvalues (remember those?) are:

λ1 = 0
λ2 = -β
λ3 = -(πRα2 + πYβ)
λ4 = -(πYα1 + πRβ)

He defines e values:

e2 = exp(λ2t) = exp(-βt)
e3 = exp(λ3t) = exp[-(πRα2 + πYβ)t]
e4 = exp(λ4t) = exp[-(πYα1 + πRβ)t]

Then come the 16 terms for the P matrix..

row 1:
πT + (πTπRY)e2 + (πCY)e4
πC + (πCπRY)e2 - (πCY)e4
πA(1 - e2)
πG(1 - e2)
row 2:
πT + (πTπRY)e2 - (πTY)e4
πC + (πCπRY)e2 + (πTY)e4
πA(1 - e2)
πG(1 - e2)
row 3:
πT(1 - e2)
πC(1 - e2)
πA + (πAπYR)e2 + (πGR)e3
πG + (πGπYR)e2 - (πGR)e3
row 4:
πT(1 - e2)
πC(1 - e2)
πA + (πAπYR)e2 - (πAR)e3
πG + (πGπYR)e2 + (πAR)e3

Naturally, I wrote a simulation in Python. It chooses α1, α2, and β = 2, 1, 0.2. I have no idea yet if those are reasonable values.

But what you can see from the output is that, if we crank the P matrix a very long time, the nucleotide frequencies approach the values in the π vector, as they should.

[[ 9.93036567e-01 5.96443233e-03 3.99600267e-04 5.99400400e-04]
[ 3.97628822e-03 9.95024711e-01 3.99600267e-04 5.99400400e-04]
[ 3.99600267e-04 5.99400400e-04 9.96011178e-01 2.98982117e-03]
[ 3.99600267e-04 5.99400400e-04 1.99321411e-03 9.97007785e-01]]

P * P
[[ 9.86145739e-01 1.18582558e-02 7.98402131e-04 1.19760320e-03]
[ 7.90550385e-03 9.90098491e-01 7.98402131e-04 1.19760320e-03]
[ 7.98402131e-04 1.19760320e-03 9.92044626e-01 5.95936909e-03]
[ 7.98402131e-04 1.19760320e-03 3.97291272e-03 9.94031082e-01]]

2500 rounds later:
[[ 0.20134221 0.30201331 0.19865779 0.29798669]
[ 0.20134221 0.30201331 0.19865779 0.29798669]
[ 0.19865779 0.29798669 0.20134239 0.30201313]
[ 0.19865779 0.29798669 0.20134209 0.30201344]]

from __future__ import division
import math
import numpy as np
from numpy import dot

# order TCAG
pi = [0.2,0.3,0.2,0.3]
T,C,A,G = pi
Y = T + C
R = A + G

a1 = 2 # Y transitions
a2 = 1 # R transitions
b = 0.2 # transversions

t = 0.01
e2 = math.e**(-b*t)
e3 = math.e**(-(R*a2 + Y*b)*t)
e4 = math.e**(-(Y*a1 + R*b)*t)

M = [
T + TY*R*e2 + CY*e4, C + CY*R*e2 - CY*e4,
A*(1 - e2), G*(1 - e2),
T + TY*R*e2 - TY*e4, C + CY*R*e2 + TY*e4,
A*(1 - e2), G*(1 - e2),
T*(1 - e2), C*(1 - e2),
A + AR*Y*e2 + GR*e3, G + GR*Y*e2 - GR*e3,
T*(1 - e2), C*(1 - e2),
A + AR*Y*e2 - AR*e3, G + GR*Y*e2 + AR*e3 ]

P = np.array(M)
P.shape = (4,4)
print 'P', '\n', P, '\n'
M = dot(P,P)
print 'P * P', '\n', M, '\n'
print '2500 rounds later:'
for i in range(2500):
M = dot(P,M)
print M