This is the last post on this project. The previous posts should be listed in the archive in the sidebar.
I have a phylogenetics project for which I constructed a boutique database. The definition is in the file
db.groups.txt
, one section of which looks like this:
aeromonas
X60415.1 Aeromonas_trota_ATCC49657
X74677.1 Aeromonas_hydrophila_ATCC7966T
EU770300.1 Aeromonas_enteropelogenes_MS12
set off from surrounding entries by double newlines. The "group"
aeromonas
has three sequences. There are also super-groups that correspond to bacterial Phyla or sub-Phyla, e.g., the gamma-Proteobacteria. These are defined in the file big.groups.txt
like this:
gamma
aeromonas
cardio
pseudo
moraxella
entero
haemo1
haemo2
steno
xantho
Normally the sequences would be fetched from Genbank by a slightly complicated script, but I'd like you to be able to follow along if you want, so we'll do it almost manually. The file
fetchSeqs.py
contains a function that parses these two files and grabs the info we need for the "gamma" super-group. It also does a manual fetch from Genbank. We save the data to disk like so:The beginning of one of the entries looks like this:
As you can see, the title line of this FASTA-formatted text is really long. So in the second script,
analyzeSeqs.py
, we use the same info from before to replace the long title with something shorter. This script also uses MUSCLE to align the sequences. You need to have it installed for this to work. Or you could use Clustal. I used the ape library in R to make a neighbor-joining tree and write this to disk as 'tree.txt'
. I show a plot of what R gave me (actually the real process for this one was more complicated, resulting the pretty colors). But at least you can see what the tree is supposed to look like.The last module is
test_plotter.py
It shows how to make a couple different kinds of plots with our plotter. All of them have colored node labels as defined in the script. The first plot is the standard one, the graphic is at the top of the post. In the next, we plot with the internal node labels showing, so that we can identify the name of a node to re-root the tree, if wanted. In the third example, I got a balanced tree from PyCogent and plot that.
You might notice that the font used for the external node labels is now italic as it should be. It doesn't look that hot, I think that matplotlib is not using the OS X system fonts. Not sure why that is, yet.
I put a zip of all the files up on Dropbox (here).