The standard HOMD 16S rRNA RefSeq database (version 10.1, 2010-02-08) can be downloaded from this page. The filename is
oral16S_20100208_9up.txt
. It is formatted for Windows. (Always something to look out for). Just do:or split on '>' and then strip(), ignoring the first (empty) value. It contains 755 sequences.
The phylogenetic assignments are in files linked here. We'll just use the first one, called
homd_taxonomy_table.txt
.The taxonomy table has 625 entries in 27 columns. The zeroth column is labeled HOT_ID (001 through 850)---which look like integers and appear to be unique. The next 6 columns (L[1:7]) contain the phylognetic classification. Then comes a column (L[7]) labeled Species which sometimes looks like "sp. oral taxon 123", and sometimes looks like "invisus". Finally, four columns over is one labeled 16S_rRNA, which may have one or more Genbank IDs (separated by '|'), or perhaps something like "To be submitted".
The sequences have title lines like:
where the 001 is appears to be the oral taxon number, followed by the strain or clone number, although sometimes it is part of the Genbank ID, and sometimes it's the first few letters of the genus and species. Sometimes, an underscore follows the oral taxon number, but usually not.
It appears that the rule was to make the title (the part before the first '|') contain 8 characters, and if the strain / clone identifier naturally contains 5, they dropped the underscore.
It seems simple enough to take the first 3 characters of the sequence's title line to map the sequences to the taxonomic information, and see how that works out. A possible difficulty is that a substantial fraction of these are not unique. There are four sequences labeled 058:
The corresponding entry of the taxonomy table is
Since this is just a demo, I'm going to remove duplicates from the sequences. That should leave us with 618. In the process we'll collapse the sequence title down to just those three characters, and rename the file 'homd.txt'.
The next step is to parse the phylogenetic information. That's also elementary so I won't show it either. We grab columns 0, 2 through 7, and 11.
The next thing to do is run the RDP classifier (using QIIME's script, see my post here).
Those results look like this:
where the taxonomy is given in 7 levels starting with "Root". Some designations are quoted: "Lactobacillales", for example. The last value is a measure of the confidence of the assignment. I will strip the first 2 levels from the RDP taxonomy (just as a I stripped "Bacteria" from the HOMD taxonomy).
Now we just load the two data sets. Look at the first few entries:
Now we look for differences. There seems to be an issue with the TM7, so we'll skip those (there are only about 10 anyway). Of 618 examined, there were 31 with discrepancies (more than I expected). I thought perhaps these would be ones that RDP was uncertain about, but that doesn't seems to be the case generally. Here are a few:
This is all such basic use of Python that I won't post the code. The next step is to use the HOMD data to explore map variation along the 16S rRNA sequence.