Python for Bioinformatics: Tree representation and one kind of traversal

Friday, November 26, 2010

Tree representation and one kind of traversal

Shown above is the tree we generated using the neighbor-joining method (here and here). I've annotated the plot from ape to label the internal nodes and show all the branch lengths. The internal node labels were assigned in the same order as they were clustered to build the tree. I didn't mention it before, but it's important to remember that this is an unrooted tree, and that's why it's drawn as it is, rather than with all the branches laid out horizontally, as in the UPGMA example (here).

I want to spend some time on trees in general, without worrying about how they were built. One goal is to understand the different modes of tree traversal (preorder and postorder and so on). Another is to explore different internal representations for trees in Python code, and conversions to and back from Newick format (and possibly XML format). In addition, I'd like to understand what it takes (in code) to "root" a tree, and how to implement various methods for tree "surgery" like "tree pruning and re-grafting" (e.g. here), and what the implications are for the Newick structure, which I find a bit hard to grasp.

In today's (very simple) example we're just going to try to visit all the nodes. The first thing is to decide on a way to represent the tree. The basic structure is naturally a dictionary, because we want to be able to "look up" the data for any individual node quickly. In a simplifying (and liberating) decision we're going to use a simple Python list to hold that data. One advantage of this is that we could accommodate star topologies easily. (And we could remember distance information by using a tuple of two lists, the names in one and distances in the second). Although a string would also do here, in general it won't work because we want to allow labels with len(label) > 1.

Only the internal nodes are in the dict; it looks like this:

tree_dict = { '0':['A','B','1'], '1':['C','0','3'],
              '2':['E','D','3'], '3':['F','1','2'] }

At this point, we have no notion of directionality. That's the liberating part. We will gain more structure as we traverse the tree.

Code to visit all the nodes naturally involves recursion, a function which processes nodes and calls itself for each of the child nodes, but returns if the current node is a tip (external). The first script (below) gives the following output, the node where we start the traversal is listed first:

['1', 'C', '0', 'A', 'B', '3', 'F', '2', 'E', 'D']
['0', 'A', 'B', '1', 'C', '3', 'F', '2', 'E', 'D']
['3', 'F', '1', 'C', '0', 'A', 'B', '2', 'E', 'D']
['2', 'E', 'D', '3', 'F', '1', 'C', '0', 'A', 'B']

There's one simple modification we can make to allow us to actually draw the tree. That is to record the relationship between two nodes as we descend into the tree. The second script is modified to do that. It prints:

A:0, C:1, B:0, E:2, D:2, F:3, 1:root, 0:1, 3:1, 2:3
A:0, C:1, B:0, E:2, D:2, F:3, 1:0, 0:root, 3:1, 2:3
A:0, C:1, B:0, E:2, D:2, F:3, 1:3, 0:1, 3:root, 2:3
A:0, C:1, B:0, E:2, D:2, F:3, 1:3, 0:1, 3:2, 2:root

Here, each value is a pair where the first name is the node, and the second is its parent. The "root" node is identified by using that label in place of a parent. Now we will be able to draw the tree, but actually doing that will involve some complexities which are better dealt with in a separate post.

utils contains flatten from here.

code listings:

import utils
tree_dict = { '0':['A','B','1'],
              '1':['C','0','3'],
              '2':['E','D','3'],
              '3':['F','1','2'] }
              
all = list(utils.flatten(tree_dict.values()))
external = [e for e in all if not e in tree_dict]

def descend(node,seen):
    seen.append(node)
    if node in tree_dict:
        for child in tree_dict[node]:
            if not child in seen:
                descend(child,seen)

def traverse_tree(root):
    seen = list()
    descend(i_node,seen)
    print seen    

for i_node in tree_dict:
    traverse_tree(i_node)

import utils
tree_dict = { '0':['A','B','1'],
              '1':['C','0','3'],
              '2':['E','D','3'],
              '3':['F','1','2'] }
              
all = list(utils.flatten(tree_dict.values()))
external = [e for e in all if not e in tree_dict]

def descend(node,caller,pD):
    pD[node] = caller
    if node in tree_dict:
        for child in tree_dict[node]:
            if not child in pD:
                descend(child,node,pD)

def traverse_tree(root):
    pD = dict()
    descend(root,'root',pD)
    # might filter for k in tree_dict
    L = [k + ':' + pD[k] for k in pD]
    print ', '.join(L)

for i_node in tree_dict:
    traverse_tree(root=i_node)