Saturday, November 20, 2010

Likelihood of an evolutionary tree

Calculating the likelihood of a particular phylogenetic tree sounds complicated and it's easy to get lost in the details, but is really just a matter of bookkeeping. That's why we have computers! This example is taken from my all-time favorite Bioinformatics textbook (Higgs & Attwood).

Let's work with the tree shown above. We're considering one particular site in a sequence alignment from a set of OTUs (operational taxonomic units), where the observed nucleotides are as shown. The identities of the internal nodes are unknown and each could be any one of the four nucleotides. (I've shown the same tree several times at different places in the post for reference).

Consider first the node labeled X. In general, the branches from X to A and from X to G may be different in length. Suppose the first one is described by t1 and the second by t2.

We have a particular model of sequence evolution for this site, encapsulated in a transition-probability matrix (as discussed here), which allows us to calculate the probabilities for all 4 possibilities with no change:


and all 12 possible changes:


The identity of X isn't known, it might be any of {A,C,G,T}. Starting with A, then we can calculate the probabilities for the two branches as:


The probability of the observed data for both of these two terminal nodes with the model that X = A, is the likelihood that X = A, given the data. Namely:

L(X=A) = PAA(t1) PAG(t2)

In the same way:

L(X=C) = PCA(t1) PCG(t2)
L(X=G) = PGA(t1) PGG(t2)
L(X=T) = PTA(t1) PTG(t2)

Of course, the total probability for X = {A,C,G,T} is the sum:

Σ PNA(t1) PNG(t2)
N = {A,C,G,T}

which has to equal 1, since X must be one of the four nucleotides.

Repeated multiplication leads to various difficulties which are avoided by using logarithms. So typically we would talk about the log likelihood, for example the log of L(X=C). All of the calculations shown below as multiplications would be done in practice by adding the logarithm of each term.

Working up the tree

Now consider node Y and its branch to X, with a branch length described by t3.

We can calculate the probabilty for any choice of nucleotides (say, Y = A, X = C) in the same way as before. For this choice it is:

PAC(t3) PCA(t1) PCG(t2)

It can also be written using the expression for likelihood L(X=C) as:

PAC(t3) L(X=C)

Stick with probabilities for a moment.

There are 16 possible paths here. Suppose Y = M and X = N, where both M and N are in the set {A,C,G,T}. Then there are 16 terms like:

PMN(t3) PNA(t1) PNG(t2)

This is only for the left-hand clade. We usually talk about the tree as if it were rotated with Z at the top and Y on the left, but it's easier to draw as shown. The probability for the right-hand branch from Y to G described by t4, is calculated for any particular choice of M = {A,C,G,T} as described in the first part:


So each of our 16 terms picks up an additional factor

PMG(t4) PMN(t3) PNA(t1) PNG(t2)

and the sum of the four terms involving Y=M (for a particular choice of M = A,C,G or T) is the likelihood L(Y=M).

A node with two internal child nodes

Now consider node W at the top of the tree and the branch t5 connecting it to Y. We let W assume any of the four states {A,C,G,T} and calculate probabilities as follows:

P(W=A,Y=A) = PAA(t5)
P(W=A,Y=C) = PAC(t5)

For the t5 branch, we have three equivalent expressions:

P(W=A,Y=A) L(Y=A)
P(W=A,Y=A) P(Y=A,X=A) L(X=A)
P(W=A,Y=A) P(Y=A,X=A) P(X=A,left=A) P(X=A,right=G)

There are 16 terms like:

P(W=A,Y=A) L(Y=A)

for each of the 16 possible values for the tuple (W,Y). Once again, this is only for the left-hand clade. Each of the 16 terms picks up another factor from the right-hand branch:

P(W=A,Y=A) L(Y=A) P(W=A,Z=A) L(Z=A)

So the total likelihood L(W=A,Y={A,C,G,T},Z={A,C,G,T}) is the sum of:

P(W=A,Y=A) L(Y=A) P(W=A,Z=A) L(Z=A)
P(W=A,Y=C) L(Y=C) P(W=A,Z=A) L(Z=A)
P(W=A,Y=G) L(Y=G) P(W=A,Z=A) L(Z=A)
P(W=A,Y=T) L(Y=T) P(W=A,Z=A) L(Z=A)

P(W=A,Y=A) L(Y=A) P(W=A,Z=C) L(Z=C)
P(W=A,Y=C) L(Y=C) P(W=A,Z=C) L(Z=C)
P(W=A,Y=G) L(Y=G) P(W=A,Z=C) L(Z=C)
P(W=A,Y=T) L(Y=T) P(W=A,Z=C) L(Z=C)

P(W=A,Y=A) L(Y=A) P(W=A,Z=G) L(Z=G)
P(W=A,Y=C) L(Y=C) P(W=A,Z=G) L(Z=G)
P(W=A,Y=G) L(Y=G) P(W=A,Z=G) L(Z=G)
P(W=A,Y=T) L(Y=T) P(W=A,Z=G) L(Z=G)

P(W=A,Y=A) L(Y=A) P(W=A,Z=T) L(Z=T)
P(W=A,Y=C) L(Y=C) P(W=A,Z=T) L(Z=T)
P(W=A,Y=G) L(Y=G) P(W=A,Z=T) L(Z=T)
P(W=A,Y=T) L(Y=T) P(W=A,Z=T) L(Z=T)

An equivalent description is to call this a sum of sums, or double sum:

Σ  Σ P(W=A,Y=M) L(Y=M) P(W=A,Z=N) L(Z=N)
M = {A,C,G,T}, N = {A,C,G,T}

The expression is symmetrical and you could use either M or N in the outside sum. But the double sum makes my head spin a bit. So I'll stick with the longer (and perhaps less confusing) formula.

We can abbreviate this as L(W=A).

Top of the tree

At the top of the tree, there is a final factor which comes in. It is π, the equilibrium frequency of each nucleotide according to our evolutionary model. The total likelihood for the tree is the sum over the four nucleotides:

Σ  πN L(W=N)

An alignment of many sites

As Higgs and Attwood say:

The likelihood is calculated for each site in this way, and it is assumed that sites evolve independently. Therefore the likelihood for the complete sequence set, Ltot, is the product of the likelihoods of the sites. It therefore follows that the log of the likelihood is the sum of the logs of the site likelihoods.

log Ltot = Σ {sites} log Lsite

Like I said, it's just a matter of bookkeeping.