Saturday, November 27, 2010

Neighbor-joining in Python: doing the plot

As discussed last time, I'm using Python lists to hold the information for each node on a phylogenetic tree. This seems as if it will have significant advantages, especially when it comes to tree "surgery." I'll leave that for another post. What I want to do today is actually draw a tree rooted at each of the internal nodes. Here is our example:

 `example_tree = { '0':['A','B','1'], '1':['C','0','3'], '2':['E','D','3'], '3':['F','1','2'] }`

Code to traverse the tree was shown last time (here). I've modified it slightly to use a dictionary. Previously we had this representation for each node and its parent (in the traversal from a particular root):

 `A:0, C:1, B:0, E:2, D:2, F:3, 1:root, 0:1, 3:1, 2:3`

Now these keys and values are in a dict. The results of the traversals from each internal node as root are:

 `A:0, C:1, B:0, E:2, D:2, F:3, 1:1, 0:1, 3:1, 2:3A:0, C:1, B:0, E:2, D:2, F:3, 1:0, 0:0, 3:1, 2:3A:0, C:1, B:0, E:2, D:2, F:3, 1:3, 0:1, 3:3, 2:3A:0, C:1, B:0, E:2, D:2, F:3, 1:3, 0:1, 3:2, 2:2`

The root node is now marked by having itself as its "parent."

The drawing code works by building a `repr` (representation) for each node from the farthest tips working up to the root. The method was inspired by PyCogent's `ancestors()` function, which returns a list of ancestors (naturally enough). The output with debug enabled looks like this:

 `ancestorsE ['2', '3', '1', '0']D ['2', '3', '1', '0']F ['3', '1', '0']2 ['3', '1', '0']C ['1', '0']3 ['1', '0']A ['0']B ['0']1 ['0']0 []`

I simply sort on the length (largest first) and work in that order. So E's `repr` is just 'E'. When we process the first i_node (2), it's `repr` will be `(E,D)2`. If we're using branch lengths, it will be: `(E:2.25,D:2.75)2`. The distance data is in a separate dictionary:

 `dist = { ('A','0'):1.0, ('B','0'):4.0, ('C','1'):2.0, ('D','2'):2.75, ('E','2'):2.25, ('F','3'):4.75, ('0','1'):1.0, ('1','3'):1.25, ('2','3'):0.75 }`

Here's the resulting Newick tree:

 `(A:1.00,B:4.00,(C:2.00,(F:4.75,(E:2.25,D:2.75)2:0.75)3:1.25)1:1.00)0;`

One of the plots is shown at the top, a second one is below. Although I had it working before, be advised there's a bug somewhere that blows up R when I try to plot the tree as 'unrooted.' So now my to_do list includes the challenge of fixing that. [UPDATE: It seems to only happen when calling R from Python using RPy---probably related to the use of a Python keyword type as an argument to the R function. ] Zipped project files are here. And a final note, I've included the upgma code from before. However, we processed the nodes with a different algorithm there, so the drawing code is different, and I haven't tried to convert it to the new method yet.