wikimedia
Before starting on analysis of the 1124 sequences from last time (here), we need to check for chimeras.
And at this point, I have a confession to make. It turns out there are 3 and perhaps 4 chimeras in the set of sequences from Genbank. I discovered this unwelcome fact a few weeks ago when playing with the QIIME toolkit. Since one of the pieces of software they recommend is ChimeraSlayer, I tried it out on these sequences.
Make a directory
temp
with a copy of seqs.fna
. The sequences first need to be converted to NAST format, then we can run ChimeraSlayer.pl
.It takes the better part of an hour on my slowest machine (a 5 year old iMac).
seqs.nast.CPS.CPC.wTaxons
has flagged four sequences:I grab those four by hand into a new file
suspects.fna
(there is probably a better way) and do:The output shows there is definitely a problem. In
suspects.nast.CPS.CPC.wTaxons
we have:The first match is great for a while, then terrible, and the second is the converse.
I need to look into whether I should update the Genbank records, but I guess probably the answer is yes.
Anyway, I should have discovered this easily. I wrote a Python tool that looks for chimeras by BLAST of the front and back "halves" of each sequence against our local "boutique" database. It prints the top five hits for each. Here is the output for three of the suspects:
Note on sequence titles: I just introduced the underscore recently (as in DA_228), so this output doesn't have them.
It's pretty obvious that these guys are problematic. What happened is that I integrated the tool into the toolchain, but I never wrote code to look through the output and flag potential problems. I always did it manually, and as additional sequence samples were added to the experiment, I forgot to carry out this step.
Moral of the story: if you want to be sure something gets done, every time, you need to automate it completely! Otherwise you might forget.
We'll remove these from our sequence file by hand. Now there are 1120.