Python for Bioinformatics: Models of sequence evolution 1

I've decided to work on Yang's Computational Molecular Evolution again. It's not an elementary book, but I like his style and clarity of explanation a lot. The goal is to understand models of sequence evolution, and more than that, to see how starting with the substitution-rate matrix Q, we can get to the transition probability matrix P. Once we have P, we can crank it as many times as we need to.

Q is a 4 x 4 matrix with the instantaneous rates for each nucleotide to change to another, or stay the same.

Everybody has a different order for the nucleotides! I usually use ACGT (because then I don't have to think twice about it), wikipedia uses AGCT some places and something else other places. Yang uses TCAG, and we'll stick with that since we're borrowing his whole exposition.

Usually, the columns of Q represent the instantaneous rate of going to a particular state, so row 1 is T => T (TT), TC TA and TG. There are exceptions (wikipedia some places, Dayhoff), but this is pretty standard.

I'm going to start with a more advanced model, called TN93 (after Tamura & Nei 1993 PMID 8336541). For TN93 we have 3 exchangeability parameters (α₁, α₂, β): these describe the rate at which either of the two transitions, or all transversions, occur.

α₁ TC
α₂ AG
β transversions

The model is time-reversible, so the exchangeability parameter for MN applies also to NM.

The equilibrium nucleotide frequencies are π_N. These are the frequencies to which a sequence would tend by many sequential applications of P. So here is Q. Notice the symmetry:


Q =     -   α₁π_C    βπ_A   βπ_G
     α₁π_T    -     βπ_A   βπ_G
      βπ_T   βπ_C     -   α₂π_G
      βπ_T   βπ_C   α₂π_A    -

For example, the instantaneous rate of change from T to C is Q₁₂, which is α₁π_C. And that's counterintuitive: it involves π_C, even though that's where we're going.

There are two other rules. First, the nucleotide frequencies must sum to one: Σ π_i = 1. Also the sum of rates in each row equals zero, so we calculate the diagonals accordingly.

Now, how do you get from Q to P? Here is the equation, which looks simple but definitely is not:


P(t) = e^Qt

But we can avoid using it for the moment. Yang gives an analytical solution (i.e. which is explicit in terms of the above parameters) for TN93. That's probably why he chose this example, because it's complex enough to be realistic but also has this kind of solution.

The eigenvalues (remember those?) are:


λ₁ =  0
λ₂ = -β
λ₃ = -(π_Rα₂ + π_Yβ)
λ₄ = -(π_Yα₁ + π_Rβ)

He defines e values:


e₂ = exp(λ₂t) = exp(-βt)
e₃ = exp(λ₃t) = exp[-(π_Rα₂ + π_Yβ)t]
e₄ = exp(λ₄t) = exp[-(π_Yα₁ + π_Rβ)t]

Then come the 16 terms for the P matrix..


row 1:
π_T + (π_Tπ_R/π_Y)e₂ + (π_C/π_Y)e₄
π_C + (π_Cπ_R/π_Y)e₂ - (π_C/π_Y)e₄
π_A(1 - e₂)
π_G(1 - e₂)
row 2:
π_T + (π_Tπ_R/π_Y)e₂ - (π_T/π_Y)e₄
π_C + (π_Cπ_R/π_Y)e₂ + (π_T/π_Y)e₄
π_A(1 - e₂)
π_G(1 - e₂)
row 3:
π_T(1 - e₂)
π_C(1 - e₂)
π_A + (π_Aπ_Y/π_R)e₂ + (π_G/π_R)e₃
π_G + (π_Gπ_Y/π_R)e₂ - (π_G/π_R)e₃
row 4:
π_T(1 - e₂)
π_C(1 - e₂)
π_A + (π_Aπ_Y/π_R)e₂ - (π_A/π_R)e₃
π_G + (π_Gπ_Y/π_R)e₂ + (π_A/π_R)e₃

Naturally, I wrote a simulation in Python. It chooses α₁, α₂, and β = 2, 1, 0.2. I have no idea yet if those are reasonable values.

But what you can see from the output is that, if we crank the P matrix a very long time, the nucleotide frequencies approach the values in the π vector, as they should.

P
[[  9.93036567e-01   5.96443233e-03   3.99600267e-04   5.99400400e-04]
[  3.97628822e-03   9.95024711e-01   3.99600267e-04   5.99400400e-04]
[  3.99600267e-04   5.99400400e-04   9.96011178e-01   2.98982117e-03]
[  3.99600267e-04   5.99400400e-04   1.99321411e-03   9.97007785e-01]]

P * P
[[  9.86145739e-01   1.18582558e-02   7.98402131e-04   1.19760320e-03]
[  7.90550385e-03   9.90098491e-01   7.98402131e-04   1.19760320e-03]
[  7.98402131e-04   1.19760320e-03   9.92044626e-01   5.95936909e-03]
[  7.98402131e-04   1.19760320e-03   3.97291272e-03   9.94031082e-01]]

2500 rounds later:
[[ 0.20134221  0.30201331  0.19865779  0.29798669]
[ 0.20134221  0.30201331  0.19865779  0.29798669]
[ 0.19865779  0.29798669  0.20134239  0.30201313]
[ 0.19865779  0.29798669  0.20134209  0.30201344]]

from __future__ import division
import math
import numpy as np
from numpy import dot

# order TCAG
pi = [0.2,0.3,0.2,0.3]
T,C,A,G = pi
Y = T + C
R = A + G

a1 = 2   # Y transitions
a2 = 1   # R transitions
b = 0.2    # transversions

t = 0.01
e2 = math.e**(-b*t)
e3 = math.e**(-(R*a2 + Y*b)*t)
e4 = math.e**(-(Y*a1 + R*b)*t)

TY,CY,AR,GR = T/Y,C/Y,A/R,G/R
M = [
    T + TY*R*e2 + CY*e4, C + CY*R*e2 - CY*e4,
    A*(1 - e2), G*(1 - e2),
    T + TY*R*e2 - TY*e4, C + CY*R*e2 + TY*e4,
    A*(1 - e2), G*(1 - e2),
    T*(1 - e2), C*(1 - e2),
    A + AR*Y*e2 + GR*e3, G + GR*Y*e2 - GR*e3,
    T*(1 - e2), C*(1 - e2),
    A + AR*Y*e2 - AR*e3, G + GR*Y*e2 + AR*e3  ]
   
P = np.array(M)
P.shape = (4,4)
print 'P', '\n', P, '\n'
M = dot(P,P)
print 'P * P', '\n', M, '\n'
print '2500 rounds later:'
for i in range(2500):
   M = dot(P,M)
print M

Tuesday, December 7, 2010

Models of sequence evolution 1