Tuesday, March 8, 2011

Human Oral Microbiome Database

Bruce Paster and colleagues at the Forsyth Institute pioneered the molecular classification of oral microbes. There is a curated database called HOMD (Human Oral Microbiome Database) and also a BLAST server. I thought we could take a quick look at the database, and exercise it by comparing the taxonomic assignments to those we get using the RDP classifier accessed through QIIME.

The standard HOMD 16S rRNA RefSeq database (version 10.1, 2010-02-08) can be downloaded from this page. The filename is oral16S_20100208_9up.txt. It is formatted for Windows. (Always something to look out for). Just do:

data = data.replace('\r\n','\n')

or split on '>' and then strip(), ignoring the first (empty) value. It contains 755 sequences.

The phylogenetic assignments are in files linked here. We'll just use the first one, called homd_taxonomy_table.txt.

The taxonomy table has 625 entries in 27 columns. The zeroth column is labeled HOT_ID (001 through 850)---which look like integers and appear to be unique. The next 6 columns (L[1:7]) contain the phylognetic classification. Then comes a column (L[7]) labeled Species which sometimes looks like "sp. oral taxon 123", and sometimes looks like "invisus". Finally, four columns over is one labeled 16S_rRNA, which may have one or more Genbank IDs (separated by '|'), or perhaps something like "To be submitted".

The sequences have title lines like:

>001A28SC| Bartonella sp. | Oral Taxon 001 | Strain A28SC | GQ422708 | 0 | U

where the 001 is appears to be the oral taxon number, followed by the strain or clone number, although sometimes it is part of the Genbank ID, and sometimes it's the first few letters of the genus and species. Sometimes, an underscore follows the oral taxon number, but usually not.

>002_3433| Caulobacter sp. | Oral Taxon 002 | Strain TWE165 | DQ493433 | 2 | U

It appears that the rule was to make the title (the part before the first '|') contain 8 characters, and if the strain / clone identifier naturally contains 5, they dropped the underscore.

It seems simple enough to take the first 3 characters of the sequence's title line to map the sequences to the taxonomic information, and see how that works out. A possible difficulty is that a substantial fraction of these are not unique. There are four sequences labeled 058:

>058_8632| Streptococcus sp. | Oral Taxon 058 | Clone C3MLM097 | AY278632 | 251 | U
>058BM035| Streptococcus sp. | Oral Taxon 058 | Clone BM035 | AY005043 | 251 | U
>058BW009| Streptococcus sp. | Oral Taxon 058 | Clone BW009 | AY005042 | 251 | U
>058CH016| Streptococcus sp. | Oral Taxon 058 | Clone CH016 | AY005044 | 251 | U

The corresponding entry of the taxonomy table is

058 Bacteria Firmicutes Bacilli Lactobacillales Streptococcaceae Streptococcus sp. oral taxon 058    AY005042|AY005043|AY005044|AY278632 251 0.72 26 Oral Clone BM035|Oral Clone BW009|Oral Clone C3MLM097|Oral Clone CH016  0 251 0 SEQF1704

Since this is just a demo, I'm going to remove duplicates from the sequences. That should leave us with 618. In the process we'll collapse the sequence title down to just those three characters, and rename the file 'homd.txt'.

The next step is to parse the phylogenetic information. That's also elementary so I won't show it either. We grab columns 0, 2 through 7, and 11.

The next thing to do is run the RDP classifier (using QIIME's script, see my post here).

> assign_taxonomy.py -i homd.txt -m rdp -o tax

Those results look like this:

001 Root;Bacteria;Proteobacteria;Alphaproteobacteria;Rhizobiales;Bartonellaceae;Bartonella 1.000

where the taxonomy is given in 7 levels starting with "Root". Some designations are quoted: "Lactobacillales", for example. The last value is a measure of the confidence of the assignment. I will strip the first 2 levels from the RDP taxonomy (just as a I stripped "Bacteria" from the HOMD taxonomy).

Now we just load the two data sets. Look at the first few entries:

homd: Proteobacteria:Alphaproteobacteria:Rhizobiales:Bartonellaceae:Bartonella
rdp: Proteobacteria:Alphaproteobacteria:Rhizobiales:Bartonellaceae:Bartonella

homd: Proteobacteria:Alphaproteobacteria:Caulobacterales:Caulobacteraceae:Caulobacter
rdp: Proteobacteria:Alphaproteobacteria:Caulobacterales:Caulobacteraceae

homd: Proteobacteria:Alphaproteobacteria:Sphingomonadales:Sphingomonadaceae:Sphingomonas
rdp: Proteobacteria:Alphaproteobacteria:Sphingomonadales:Sphingomonadaceae:Sphingomonas

Now we look for differences. There seems to be an issue with the TM7, so we'll skip those (there are only about 10 anyway). Of 618 examined, there were 31 with discrepancies (more than I expected). I thought perhaps these would be ones that RDP was uncertain about, but that doesn't seems to be the case generally. Here are a few:

homd: Bacteroidetes:Flavobacteria:Flavobacteriales:Bacteroidetes[F-7]:Bacteroidetes[G-7]
rdp: Bacteroidetes:Sphingobacteria:Sphingobacteriales:Sphingobacteriaceae:Pedobacter

homd: Firmicutes:Mollicutes:Acholeplasmatales:Acholeplasmatales[F-1]:Acholeplasmatales[G-1]
rdp: o

homd: Bacteroidetes:Bacteroides:Bacteroidales:Bacteroidetes[F-5]:Bacteroidetes[G-5]
rdp: Bacteroidetes:o

homd: Firmicutes:Mollicutes:Mycoplasmatales:Mycoplasmataceae:Mycoplasma
rdp: Tenericutes:Mollicutes:Mycoplasmatales:Mycoplasmataceae:Mycoplasma:o

This is all such basic use of Python that I won't post the code. The next step is to use the HOMD data to explore map variation along the 16S rRNA sequence.