Tuesday, March 23, 2010

Fitch and Sankoff Algorithms for Parsimony

This example comes from Felsenstein's book Inferring Phylogenies (pp.11-16). Consider a tree with the structure and observed data as shown below, for a particular position in a multiple sequence alignment.

"The Fitch algorithm considers the sites (or characters) one at a time. At each tip in the tree, we create a set containing those nucleotides (states) that are observed or are compatible with the observation. Thus, if we see an A, we create the set {A}. If we see an ambiguity such as R, we create the set {AG}. Now we move down the tree [away from the tips]. In algorithmic terms, we do a postorder tree traversal. At each interior node we create a set that is the intersection of sets at the two descendant nodes. However, if that set is empty, we instead create the set that is the union of the two sets at the descendant nodes. Every time we create such a union, we also count one change of state."




In this example then, the terminal nodes are unambiguous. For N2 and N4 we create the unions N2={AC} and N4={AG}. For N3 we create another union N3={ACG}. For the node N1, we can take the intersection of {AC} and {ACG}, which is N1={C}. We created three unions, so there will be three changes of state in the most parsimonious tree.

Sets do not necessarily contain the ancestral states. For this example, we might assign state C to all of the ancestral nodes. Alternatively, we could assign A to all of them. Both trees would require 3 changes of state. But A is not in the set N1={C}.

The Sankoff algorithm allows the incorporation of varying costs for different nucleotide changes. Felsenstein continues the example with a (symmetric) cost matrix. (I have doubled his costs to keep the example all integers).




A    C    G    T
A   0    5    2    5
C   5    0    5    2
G   2    5    0    5
T   5    2    5    0


For each node we set up a small array of values, corresponding to the states (A C G T) in order---shown in boxes in the figure. For the terminal nodes, we assign cost 0 for the correct state, and infinite cost for the others. Since we're not going to use these others, I will just mark them with a '-'.



Now consider node N2. It might be in any of the four states. For each state, we compute the cost of moving down to its (two) terminal nodes. For example, starting from C there would be a cost of 5 (for the C to A change), and no cost for the top branch, which has no change. The same holds for A. For the other two states there would be two penalties, of which one will be 5 and one (the transition) will be only 2. We complete the calculation with the array (5 5 7 7).

For node N4, the calculation is different because the terminal nodes are both purines. Thus, for an initial C or T, there is a total cost of 10, and for an initial A or G, there is a total cost of 2.

For N3, which also has a terminal node as direct descendant, there is again a small calculation to do. It is slightly trickier than before. If the state of N3 is A, then the cost to change to C in the terminal node is 5 and there is a further cost of 2 which will be exacted while going from node N4=A to the terminal G, which we read from the array of values at N4. The first entry at N3 is then 7. A similar argument holds for G at N3 by symmetry.

On the other hand, if the state at N3 is C, then there is no cost for the terminal node, but we read the corresponding value 10 from the array at N4. This is the cost to stay C at N4. However, we can do better than that. Suppose we change from C to A at N4 (cost 5). Then there is only an additional cost of 2 for the change from A to G, for a total of 7.

The last possibilty at N3 is to have T. In that worst-case scenario we have a cost of 2 for C -> T, 5 for C -> R at N4, and then a further cost of 2 to reach the other purine. The total is 9.

For nodes which do not have any terminal nodes as direct descendants, we simply add the values. So the array for N1 is the sum of the arrays for its descendants (12 12 14 16). This makes it easy to compute the arrays for most interior nodes in a large tree.

If you read the first part of this post from before, you will recognize this as a dynamic programming algorithm. Felsenstein explains that the two algorithms are closely connected, although he resists the temptation to say that Fitch is a "special case" of Sankoff. But he explains that why Fitch works is easy to understand from a consideration of Sankoff.

Read the book to find out more. This text is the standard reference for the field. Felsenstein is almost always very clear, although I have to say that he is occasionally too encyclopedic for my interests. Still, it is highly recommended).




[UPDATE:  Seven years later, I have received a comment from a reader who correctly points out an error in this post.  Let's see if I can explain what the problem is and how to fix it.  

The issue first shows up in evaluating node N3.  As before, we write the costs for the states in the order A C G T.  The cost matrix is 

0 for no change
2 for transitions
5 for transversions (U <-> Y)

The naive approach is as follows.  Suppose N3 = A, then the cost to change to C in the terminal node is 5 and *if N4 also has A*, the cost on that branch is the value in the array at N4(A) = 2, giving a total cost of 5 + 2 = 7, which is the first value shown for N3.  

But this will only be correct if it is the *minimum* cost for N3 = A.  

We must also evaluate the costs for having N3 = A and N4 be something other than A.

A -> A = 0 + 2 = 2
A -> C = 5 + 10 = 15
A -> G = 2 + 2 = 4
A -> T = 5 + 10 = 15

(Here the first value is for the change from N3 -> N4 and the second comes from the N4 array).

Since A -> A is the minimum score we add 2 to the 5 for A -> C at the terminal node to obtain the final value for A at N3 of 7.

We continue in this manner for each array position:

C -> A = 5 + 2 = 7
C -> C = 0 + 10 = 10
C -> G = 5 + 2 = 7
C -> T = 2 + 10 = 12

Pick the minimum as the value for C at N3.  Here we see that the cost is lower if we switch to a purine at N4.  There is no cost for C -> C at the terminal node. 

G -> A = 2 + 2 = 4
G -> C = 5 + 10 = 15
G -> G = 0 + 2 = 2
G -> T = 5 + 10 = 15

The minimum has G at N3.  To the 2 we add 5 for A -> C at the terminal node to obtain the final value for A at N3.

T -> A = 5 + 2 = 7
T -> C = 2 + 10 = 12
T -> G = 5 + 2 = 7
T -> T = 0 + 10 = 10

The value for T at N3 should be 12 (counting the cost for T -> C in the terminal node).  The figure is in error, as is the text, because I was writing C -> T when I should have written T -> C.

Now we come to the root, N1.  Again, we need to find the minimum cost.  We evaluate N2 and N3 separately.  For N1 -> N2 we have:

A -> A = 0 + 5 = 5
A -> C = 5 + 5 = 10
A -> G = 2 + 7 = 9
A -> T = 5 + 7 = 12

For N1 -> N3 we have

A -> A = 0 + 7 = 7
A -> C = 5 + 7 = 12
A -> G = 2 + 7 = 9
A -> T = 5 + 12 = 17

So the minimum cost for A at N1 is 12, which occurs when N2 and N3 are both A as well.

C -> A = 5 + 5 = 10
C -> C = 0 + 5 = 5
C -> G = 5 + 7 = 12
C -> T = 2 + 7 = 9

C -> A = 5 + 7 = 12
C -> C = 0 + 7 = 7
C -> G = 5 + 7 = 12
C -> T = 2 + 12 = 14

The minimum cost for C at N1 is 12.

G -> A = 2 + 5 = 7
G -> C = 5 + 5 = 10
G -> G = 0 + 7 = 7
G -> T = 5 + 7 = 12

G -> A = 2 + 7 = 9
G -> C = 5 + 7 = 12
G -> G = 0 + 7 = 7
G -> T = 5 + 12 = 17

The minimum cost for G at N1 is 14.

T -> A = 5 + 5 = 10
T -> C = 2 + 5 = 7
T -> G = 5 + 7 = 12
T -> T = 0 + 7 = 7

T -> A = 5 + 7 = 12
T -> C = 2 + 7 = 9
T -> G = 5 + 7 = 12
T -> T = 0 + 12 = 12

The minimum cost for T at N1 is 16.

If you're keeping score at home, you will notice that this is the same result we obtained by simply adding the values from the arrays for N3 and N4.

The counter-example from above had us switching to a purine at N4 to overcome the substantial costs of Y -> A,G = 10 at the terminal nodes.

What it looks like to me after consulting other sources is that, most of the time, simply adding the values from the two arrays works.  However, carrying out the full minimization procedure is the correct approach.

7 comments:

Joe Felsenstein said...

Your explanations are very clear (and correct).

Thanks for the kind words about my book. I wish we had a counterpart that was at a less encyclopedic level. Page and Holmes was nice (as you said in one of your other posts) but alas they will not revise it (at least with that publisher) because of their unhappiness with their publisher.

telliott99 said...

I see a small error in the figure on the right-hand margin. The little boxes are supposed to represent costs in order ACGT. So the box next to C is correct, but the one next to A should have 0--- and the one next to G, --0-. Since we don't actually use the values, it didn't cause a problem.

Garry Jolley-Rogers said...

Very good useful too.
I am a little confused shouldn't the union of {AC} and {ACG} be {AC} not {C}. Is there a missing step?

telliott99 said...

@Garry_Jolley-Rogers: We want the intersection, if possible.

mrbrklyn said...

This explanation is NOT clear, and I'm not sure it is even correct. Felsenstein's text is even LESS clear. So you improved on his explanation, but it is still not correct.

The last step, one does not just add up the arrays. It is recursive and needs to do the same minimal search as the previous steps. The text just slopply and lazily skips over this. Your example is better until you get to the last step.

The key thing is that at each level we are looking for the minimal possible cost.

On the last node, for example, we start by looking at the first position in the array, which represents "a". The minimal value happens to be the first of the left child node, but it doesn't have to be that way. It represents a->a has a 0 cost, and carry over the 2.5. If we do a->c or a->g or a->t, none come out LESS. So we accept a->a at and the 2.5 value when evaluating the left side.

Now, evaluating the right, again a->a is the minimal value (but it doesn't have to be).

You have 0 cost on the a->a transition and 3.5 on the carry over. Total for both sides is 6 (or 12 in the case of your costs which are doubled the text).


I would like to see the code you worked up for this. It is probably more correct than the explanation.

telliott99 said...

To the previous commenter: you could work on your attitude some. But thank you for pointing out the error.

23a6c266-1387-11e4-9fd7-9b46c275da1e said...

Sorry if the ton seemed difficult. I had spent considerable time trying to tease the algorithm from texts and examples. I did write up a complete example of the allgorithm in C++, the only one I know of openly published, as part of my Thesis which is available http://www.nylxs.com/docs/thesis/sources/thesis.pdf and finch is http://www.nylxs.com/docs/thesis/src/fitch/

Thanks - Ruben Safir