Python for Bioinformatics: Dental project (5)

Saturday, March 12, 2011

Dental project (5)

This post is one of a series (see dental project here or in the sidebar).

Last time I said I would show you how I make heatmaps these days. I've approached it several different ways over the past few years (R, Cocoa, matplotlib), but I think now that matplotlib is best, at least for me. Ultimately what I want is flexibility, and if you're a Python coder and you have matplotlib installed (as we've also discussed many times), then you'll have that. But I don't want to get into the technical details---and actually the script is a bit long, so I just put it (Heatmapper.py and its helper Preprocessor.py) into the zipped project files on Dropbox (here). The output from two different modes is at the bottom of the post. You just need a file data.csv in the same directory. It looks a little fuzzy and not as clean as I would like, but that's because there are so many samples, and partly because of the italic font. If you do savefig to a pdf file, and then blow it up, it looks great.

In this post I want to talk in a general way about the project and what I think it means. It began about four years ago, when we became aware that some folks in Dentistry at our school (WVU) were involved in a huge study of people from Appalachia (it's called COHRA). Poor oral health is a particular problem in West Virginia, and this study had collected thousands of samples along with patient histories and lots of clinical data. My belief is that the important thing about these samples is that the patients are young yet have serious periodontal issues. In any event, we convinced the people who actually run the project (based elsewhere) to let us have (a small part of) 8 samples out of all their thousands sitting in the freezers down the hall.

We did PCR with "universal" primers for the bacterial 16S rRNA gene, and cloned and sequenced the numbers you see in the table. It's not a big study (we don't have much money anyway), but we saw something which I think is truly significant. In high disease individuals, a broad group of microbes from the Clostridiales including an unusual clade called the Veillonellaceae are increased in abundance, whereas the sequences from control individuals in this clade were all very closely related to Veillonella parvula.

One reason this observation may be important is that the so-called "red complex", which is thought to be associated with serious periodontal disease, can only be recovered in about half the individuals with this diagnosis (not even considering abundance).

That story is in the modified version of the map above, where I drew a red box around the region of interest for the three controls, or "low disease" samples. Time went on, and another set of samples was added to the study from a different group, and we were able to get the work published. So that's why the study looks so old-fashioned, in an era of millions of reads, we've got about a thousand.

My role in all this was to actually do the analysis. I remember "we" wrote a grant (actually, someone else did!) and listed me as a technical expert in bioinformatics. Of course, the reviews were scathing. Dr. E doesn't have a degree in bioinfomatics. How could he know anything?

Well, I've learned a few things over the years. Rule one is, never make your own database: let someone else do it. That's why HOMD (and Greengenes and RDP) are so great. I particularly like the tools at the RDP site. It is very nice software.

And rule two is, if you live long enough, you will see work that took you months or years to accomplish be achieved using new tools in mere seconds or hours. Sequencing is a great example of this. When I was young I spent most of three months getting 500 bp; when I was a bit older I invested six months for 3.5 kb; still later it was a year for 20 kb.

This project is another example. I spent a year and more writing some 50 or so Python scripts (and rewriting them), and now QIIME does the whole thing in mere seconds.

Well, not quite the whole thing. I have a bit more to do with this project. I want to show you the UniFrac analysis of beta diversity, and show how to make what I think is a nicer plot of the PCoA results. Also, I want to show some phylogenetic trees detailing the increased diversity (species richness, really) in the Veillonellaceae that I mentioned.

And I should say: it's been fun. Even if I don't have that degree, or any papers with Rob Knight, I think I've learned something about Bioinformatics in the last 5 years.