Python for Bioinformatics: Gene Ontology continued

Saturday, December 4, 2010

Gene Ontology continued

This is the continuation of a project using the Gene Ontology (first post here). For this part, you'll need to get the annotations associated with the yeast genome---at least that's what I used (here). In the project files (link below) you'll find a short script that loads the data from this file. It expects to find the file in the db folder.

Another short script useGO.py just exercises things a bit. We load the GO data and the yeast annotations. Given a target list (in this case ['pheromone']), then we look for all the yeast genes containing that word in the description field (at index 9 of the original yeast db file). We recover these GO ids and print all the applicable GO terms, obtained using the recursive code from the first post.

Sample output shows a single one of the genes found:

MFA1
Mating pheromone a-factor, made by a cells
['GO:0000750']
['pheromone-dependent signal transduction involved in conjugation with cellular fusion']
GO:0000750 pheromone-dependent signal transduction involved in conjugation with cellular fusion
GO:0007186 G-protein coupled receptor protein signaling pathway
GO:0007166 cell surface receptor linked signaling pathway
GO:0023033 signaling pathway
GO:0023052 signaling
GO:0008150 biological_process
GO:0032005 regulation of conjugation with cellular fusion by signal transduction
GO:0007165 signal transduction
GO:0050794 regulation of cellular process
GO:0050789 regulation of biological process
GO:0065007 biological regulation
GO:0031137 regulation of conjugation with cellular fusion
GO:0046999 regulation of conjugation
GO:0043900 regulation of multi-organism process
GO:0048610 cellular process involved in reproduction
GO:0009987 cellular process
GO:0022414 reproductive process

I think you can see what GO is supposed to be about. We gradually progress to more and more general categories as we work our way up the tree.

What's not obvious in the approach I used so far is that these chains of terms end with one of three different major categories. These are:

GO:0003674 ['molecular_function']
GO:0008150 ['biological_process']
GO:0005575 ['cellular_component']

The other thing is that I've obscured the branching, but I have a modification to the code that gives this information. And I have a graph that plots it. More in a later post. Zipped project files here.