Tuesday, May 6, 2008

Home-grown plotter

For various reasons, which probably are not sufficient to justify the time I've invested, I decided to try to plot phylogenetic trees myself. Perhaps one reason is that I could not find a program with a GUI for OS X. Another reason is that programming small examples is the best way I know to learn Bioinformatics.

The Newick format is commonly used for trees. For example: (A:0.1,B:0.2,(C:0.3,D:0.4):0.5);

Some examples have the internal nodes named, but the default output I started with from R's 'ape' package does not. So, keeping things simple to begin with, my code only handles trees with unlabeled internal nodes. (We assign those nodes labels of our own internally). It does allow trees which are not strictly bifurcating, like the above sample.

I'll use as an example of some rRNA sequences for species in the genus Campylobacter. The tree as written to a text file looks like this:


Here is the output from R for a plot, where I have substituted standard species names for the Genbank IDs:

I'll try to show in future posts how my code works. (You might download it and try it on this sample tree). In outline we:
• capture the titles and substitute simple labels
• parse the structure of individual nodes
• walk the tree and measure distances from each node to the next and the root
• save the information to a text file with fields:
   - label
   - distance to root
   - distance from bottom of plot
   - distance to next node
   - y coordinate of first subnode (for internal nodes)
   - y coordinate of last subnode (ditto)

Above is the image produced by my Cocoa application that does the plotting (though one could certainly write R code to do it using the information we saved). This basic set of programs works. One thing I need to do now is much more extensive testing (boring). Also, it would be nice to have some shiny GUI elements. Imagine a joystick to move labels around, or being able to select subtrees for manipulation (like coloring).

The tree is the same as the first one. You would be able to see it more easily if we were using the species names, but I haven't done that part yet. Some nodes have been rotated, but that's legal. The only other difference is a new OTU (operational taxonomic unit) or species (labeled L) that I added to my database since I made the first figure a few weeks ago.