Tuesday, March 23, 2010

Fitch and Sankoff Algorithms for Parsimony

This example comes from Felsenstein's book Inferring Phylogenies (pp.11-16). Consider a tree with the structure and observed data as shown below, for a particular position in a multiple sequence alignment.

"The Fitch algorithm considers the sites (or characters) one at a time. At each tip in the tree, we create a set containing those nucleotides (states) that are observed or are compatible with the observation. Thus, if we see an A, we create the set {A}. If we see an ambiguity such as R, we create the set {AG}. Now we move down the tree [away from the tips]. In algorithmic terms, we do a postorder tree traversal. At each interior node we create a set that is the intersection of sets at the two descendant nodes. However, if that set is empty, we instead create the set that is the union of the two sets at the descendant nodes. Every time we create such a union, we also count one change of state."




In this example then, the terminal nodes are unambiguous. For N2 and N4 we create the unions N2={AC} and N4={AG}. For N3 we create another union N3={ACG}. For the node N1, we can take the intersection of {AC} and {ACG}, which is N1={C}. We created three unions, so there will be three changes of state in the most parsimonious tree.

Sets do not necessarily contain the ancestral states. For this example, we might assign state C to all of the ancestral nodes. Alternatively, we could assign A to all of them. Both trees would require 3 changes of state. But A is not in the set N1={C}.

The Sankoff algorithm allows the incorporation of varying costs for different nucleotide changes. Felsenstein continues the example with a (symmetric) cost matrix. (I have doubled his costs to keep the example all integers).




A    C    G    T
A   0    5    2    5
C   5    0    5    2
G   2    5    0    5
T   5    2    5    0


For each node we set up a small array of values, corresponding to the states (A C G T) in order---shown in boxes in the figure. For the terminal nodes, we assign cost 0 for the correct state, and infinite cost for the others. Since we're not going to use these others, I will just mark them with a '-'.



Now consider node N2. It might be in any of the four states. For each state, we compute the cost of moving down to its (two) terminal nodes. For example, starting from C there would be a cost of 5 (for the C to A change), and no cost for the top branch, which has no change. The same holds for A. For the other two states there would be two penalties, of which one will be 5 and one (the transition) will be only 2. We complete the calculation with the array (5 5 7 7).

For node N4, the calculation is different because the terminal nodes are both purines. Thus, for an initial C or T, there is a total cost of 10, and for an initial A or G, there is a total cost of 2.

For N3, which also has a terminal node as direct descendant, there is again a small calculation to do. It is slightly trickier than before. If the state of N3 is A, then the cost to change to C in the terminal node is 5 and there is a further cost of 2 which will be exacted while going from node N4=A to the terminal G, which we read from the array of values at N4. The first entry at N3 is then 7. A similar argument holds for G at N3 by symmetry.

On the other hand, if the state at N3 is C, then there is no cost for the terminal node, but we read the corresponding value 10 from the array at N4. This is the cost to stay C at N4. However, we can do better than that. Suppose we change from C to A at N4 (cost 5). Then there is only an additional cost of 2 for the change from A to G, for a total of 7.

The last possibilty at N3 is to have T. In that worst-case scenario we have a cost of 2 for C -> T, 5 for C -> R at N4, and then a further cost of 2 to reach the other purine. The total is 9.

For nodes which do not have any terminal nodes as direct descendants, we simply add the values. So the array for N1 is the sum of the arrays for its descendants (12 12 14 16). This makes it easy to compute the arrays for most interior nodes in a large tree.

If you read the first part of this post from before, you will recognize this as a dynamic programming algorithm. Felsenstein explains that the two algorithms are closely connected, although he resists the temptation to say that Fitch is a "special case" of Sankoff. But he explains that why Fitch works is easy to understand from a consideration of Sankoff.

Read the book to find out more. This text is the standard reference for the field. Felsenstein is almost always very clear, although I have to say that he is occasionally too encyclopedic for my interests. Still, it is highly recommended).




[UPDATE:  Seven years later, I have received a comment from a reader who correctly points out an error in this post.  Let's see if I can explain what the problem is and how to fix it.  

The issue first shows up in evaluating node N3.  As before, we write the costs for the states in the order A C G T.  The cost matrix is 

0 for no change
2 for transitions
5 for transversions (U <-> Y)

The naive approach is as follows.  Suppose N3 = A, then the cost to change to C in the terminal node is 5 and *if N4 also has A*, the cost on that branch is the value in the array at N4(A) = 2, giving a total cost of 5 + 2 = 7, which is the first value shown for N3.  

But this will only be correct if it is the *minimum* cost for N3 = A.  

We must also evaluate the costs for having N3 = A and N4 be something other than A.

A -> A = 0 + 2 = 2
A -> C = 5 + 10 = 15
A -> G = 2 + 2 = 4
A -> T = 5 + 10 = 15

(Here the first value is for the change from N3 -> N4 and the second comes from the N4 array).

Since A -> A is the minimum score we add 2 to the 5 for A -> C at the terminal node to obtain the final value for A at N3 of 7.

We continue in this manner for each array position:

C -> A = 5 + 2 = 7
C -> C = 0 + 10 = 10
C -> G = 5 + 2 = 7
C -> T = 2 + 10 = 12

Pick the minimum as the value for C at N3.  Here we see that the cost is lower if we switch to a purine at N4.  There is no cost for C -> C at the terminal node. 

G -> A = 2 + 2 = 4
G -> C = 5 + 10 = 15
G -> G = 0 + 2 = 2
G -> T = 5 + 10 = 15

The minimum has G at N3.  To the 2 we add 5 for A -> C at the terminal node to obtain the final value for A at N3.

T -> A = 5 + 2 = 7
T -> C = 2 + 10 = 12
T -> G = 5 + 2 = 7
T -> T = 0 + 10 = 10

The value for T at N3 should be 12 (counting the cost for T -> C in the terminal node).  The figure is in error, as is the text, because I was writing C -> T when I should have written T -> C.

Now we come to the root, N1.  Again, we need to find the minimum cost.  We evaluate N2 and N3 separately.  For N1 -> N2 we have:

A -> A = 0 + 5 = 5
A -> C = 5 + 5 = 10
A -> G = 2 + 7 = 9
A -> T = 5 + 7 = 12

For N1 -> N3 we have

A -> A = 0 + 7 = 7
A -> C = 5 + 7 = 12
A -> G = 2 + 7 = 9
A -> T = 5 + 12 = 17

So the minimum cost for A at N1 is 12, which occurs when N2 and N3 are both A as well.

C -> A = 5 + 5 = 10
C -> C = 0 + 5 = 5
C -> G = 5 + 7 = 12
C -> T = 2 + 7 = 9

C -> A = 5 + 7 = 12
C -> C = 0 + 7 = 7
C -> G = 5 + 7 = 12
C -> T = 2 + 12 = 14

The minimum cost for C at N1 is 12.

G -> A = 2 + 5 = 7
G -> C = 5 + 5 = 10
G -> G = 0 + 7 = 7
G -> T = 5 + 7 = 12

G -> A = 2 + 7 = 9
G -> C = 5 + 7 = 12
G -> G = 0 + 7 = 7
G -> T = 5 + 12 = 17

The minimum cost for G at N1 is 14.

T -> A = 5 + 5 = 10
T -> C = 2 + 5 = 7
T -> G = 5 + 7 = 12
T -> T = 0 + 7 = 7

T -> A = 5 + 7 = 12
T -> C = 2 + 7 = 9
T -> G = 5 + 7 = 12
T -> T = 0 + 12 = 12

The minimum cost for T at N1 is 16.

If you're keeping score at home, you will notice that this is the same result we obtained by simply adding the values from the arrays for N3 and N4.

The counter-example from above had us switching to a purine at N4 to overcome the substantial costs of Y -> A,G = 10 at the terminal nodes.

What it looks like to me after consulting other sources is that, most of the time, simply adding the values from the two arrays works.  However, carrying out the full minimization procedure is the correct approach.