*Computational Molecular Evolution*again. It's not an elementary book, but I like his style and clarity of explanation a lot. The goal is to understand models of sequence evolution, and more than that, to see how starting with the substitution-rate matrix Q, we can get to the transition probability matrix P. Once we have P, we can crank it as many times as we need to.

Q is a 4 x 4 matrix with the instantaneous rates for each nucleotide to change to another, or stay the same.

Everybody has a different order for the nucleotides! I usually use ACGT (because then I don't have to think twice about it), wikipedia uses AGCT some places and something else other places. Yang uses TCAG, and we'll stick with that since we're borrowing his whole exposition.

Usually, the columns of Q represent the instantaneous rate of going

*to*a particular state, so row 1 is T => T (TT), TC TA and TG. There are exceptions (wikipedia some places, Dayhoff), but this is pretty standard.

I'm going to start with a more advanced model, called TN93 (after Tamura & Nei 1993 PMID 8336541). For TN93 we have 3 exchangeability parameters (α

_{1}, α

_{2}, β): these describe the rate at which either of the two transitions, or all transversions, occur.

The model is time-reversible, so the exchangeability parameter for MN applies also to NM.

The equilibrium nucleotide frequencies are π

_{N}. These are the frequencies to which a sequence would tend by many sequential applications of P. So here is Q. Notice the symmetry:

For example, the instantaneous rate of change from T

*to*C is Q

_{12}, which is α

_{1}π

_{C}. And that's counterintuitive: it involves π

_{C}, even though that's where we're

*going*.

There are two other rules. First, the nucleotide frequencies must sum to one:

`Σ π`_{i} = 1

. Also the sum of rates in each row equals zero, so we calculate the diagonals accordingly.Now, how do you get from Q to P? Here is the equation, which looks simple but definitely is not:

But we can avoid using it for the moment. Yang gives an analytical solution (i.e. which is explicit in terms of the above parameters) for TN93. That's probably why he chose this example, because it's complex enough to be realistic but also has this kind of solution.

The eigenvalues (remember those?) are:

He defines e values:

Then come the 16 terms for the P matrix..

Naturally, I wrote a simulation in Python. It chooses

`α`_{1}, α_{2}, and β = 2, 1, 0.2

. I have no idea yet if those are reasonable values.But what you can see from the output is that, if we crank the P matrix a very long time, the nucleotide frequencies approach the values in the π vector, as they should.

## No comments:

Post a Comment