Wednesday, March 9, 2011

Dental project (3)



wikimedia

Before starting on analysis of the 1124 sequences from last time (here), we need to check for chimeras.

And at this point, I have a confession to make. It turns out there are 3 and perhaps 4 chimeras in the set of sequences from Genbank. I discovered this unwelcome fact a few weeks ago when playing with the QIIME toolkit. Since one of the pieces of software they recommend is ChimeraSlayer, I tried it out on these sequences.

Make a directory temp with a copy of seqs.fna. The sequences first need to be converted to NAST format, then we can run ChimeraSlayer.pl.


prog1=~/Software/microbiomeutil_2010-11-02/NAST-iEr/run_NAST-iEr.pl
$prog1 --query_FASTA seqs.fna > seqs.nast

prog2=~/Software/microbiomeutil_2010-11-02/ChimeraSlayer/ChimeraSlayer.pl
$prog2 --query_NAST seqs.nast


It takes the better part of an hour on my slowest machine (a 5 year old iMac).

seqs.nast.CPS.CPC.wTaxons has flagged four sequences:


DA228 INTRA-GENUS
DQ822 INTRA-PHYLUM
DV55 INTRA-FAMILY
DAA89 INTRA-FAMILY


I grab those four by hand into a new file suspects.fna (there is probably a better way) and do:


$prog1 --query_FASTA suspects.fna > suspects.nast
$prog2 --query_NAST suspects.nast --printCSalignments option


The output shows there is definitely a problem. In suspects.nast.CPS.CPC.wTaxons we have:


ChimeraSlayer DQ_822 S000427388 S000260335 1.0566 98.74 100 0.7978 74.56 0 YES NAST:1861-1863 ECO:324-325 Streptococcus Streptococcus cristatus (T); NCTC12479; AB008313 Streptococcus cristatus Lachnospiraceae Incertae Sedis Clostridium aerotolerans (T); DSM 5434; X76163 Clostridium aerotolerans INTRA-PHYLUM
Per_id parents: 73.80

Per_id(Q,A): 93.45
--------------------------------------------------- A: S000427388
99.64 78.63
~~~~~~~~~~~~~~~~~~~~~~~~\ /~~~~~~~~~~~~~~~~~~~~~~~~ Q: DQ_822
DivR: 1.057 BS: 100.00 |
Per_id(QLA,QRB): 98.74 |
|
(L-AB: 72.50) | (R-AB: 76.92)
WinL:0-279 | WinR:280-396
|
Per_id(QLB,QRA): 74.56 |
DivR: 0.798 BS: 0.00 |
~~~~~~~~~~~~~~~~~~~~~~~~/ \~~~~~~~~~~~~~~~~~~~~~~~~~ Q: DQ_822
72.86 96.58
---------------------------------------------------- B: S000260335
Per_id(Q,B): 79.85

DeltaL: 26.79 DeltaR: -17.95

!
CTAACGAGGAGGCGCTTGGTTAAGGGCTAGCTAAATTGCATGATGGTCAATGGGAAATCC A: S000427388
CTAACGAGGAGGCGCTTGGTTAAGGGCTAGCTAAATTGCATGATAGTCAATGGGAAATCC Q: DQ_822
TCCGACTAAGATTCGGAATCGGGCACAAGATCTCGACAGGCAGCACAGTGGAACTCCGGT B: S000260335
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! !!!!!!!!!!!!!!!


ACCCTTGTCGTGACATC A: S000427388
ACCCTTGTCGTGACATC Q: DQ_822
GGTGCCTGGACCCAGCT B: S000260335
!!!!!!!!!!!!!!!!!

** Breakpoint **

!!!!!!!!!!!!! !!!!! !!!!!!!
GACCGACGGTCGAGTGTATCGGGGTAAA A: S000427388
AGAGAGGACCTCGGTTATATGACAGCGG Q: DQ_822
AGAGAGGACCTCGAGTATATCACAGCAG B: S000260335
!! ! !


The first match is great for a while, then terrible, and the second is the converse.

I need to look into whether I should update the Genbank records, but I guess probably the answer is yes.

Anyway, I should have discovered this easily. I wrote a Python tool that looks for chimeras by BLAST of the front and back "halves" of each sequence against our local "boutique" database. It prints the top five hits for each. Here is the output for three of the suspects:


>DQ822
BLAST front(len = 207):
320 207/207 100.00 Streptococcus_clone_BP2-57_AB121930.1
319 205/207 99.03 Streptococcus_clone_502H08_AM420202.1
323 203/207 98.07 Streptococcus_cristatus_AB008313.1
321 200/207 96.62 Streptococcus_clone_BW009_AY005042.1
334 199/209 95.22 Streptococcus_sanguinis_SK36_SK36
BLAST back(len = 187):
349 179/187 95.72 Uncultured_clone_4.59_DQ346409.1
79 179/187 95.72 Clostridiales_clone_301C11_AM420062.1
121 178/187 95.19 Eubacterium_clone_DO008_AF385508.1
117 178/187 95.19 Eubacterium_clone_BP2-88_AB121960.1
115 178/187 95.19 Eubacterium_clone_BL026B96_AY806377.1

>DV55
BLAST front(len = 210):
353 207/210 98.57 Uncultured_clone_E105_DQ326659.1
100 206/210 98.10 Dialister_sp._E2_20_AF481209.1
94 205/210 97.62 Dialister_invisus_AY162469.1
31 190/207 91.79 Allisonella_clone_BL34_DQ130020.1
93 192/210 91.43 Dialister_clone_MCE7_134_AF481210.1
BLAST back(len = 190):
373 190/190 100.00 Veillonella_parvula_X84005.1
372 190/190 100.00 Veillonella_clone_X042_AF287781.1
370 190/190 100.00 Veillonella_clone_BU083_AF366266.1
369 190/190 100.00 Veillonella_clone_AA050_AF287782.1
371 182/183 99.45 Veillonella_clone_R1_DQ123569.1

>DAA89
BLAST front(len = 210):
283 209/210 99.52 Selenomonas_clone_CI002_AF287798.1
286 202/210 96.19 Selenomonas_clone_EQ054_AF385495.1
288 201/210 95.71 Selenomonas_clone_FT050_AY349403.1
298 189/195 96.92 Selenomonas_noxia_AF287799.1
297 200/210 95.24 Selenomonas_infelix_AF287802.1
BLAST back(len = 190):
373 184/187 98.40 Veillonella_parvula_X84005.1
370 184/187 98.40 Veillonella_clone_BU083_AF366266.1
372 181/184 98.37 Veillonella_clone_X042_AF287781.1
369 181/184 98.37 Veillonella_clone_AA050_AF287782.1
371 182/187 97.33 Veillonella_clone_R1_DQ123569.1


Note on sequence titles: I just introduced the underscore recently (as in DA_228), so this output doesn't have them.

It's pretty obvious that these guys are problematic. What happened is that I integrated the tool into the toolchain, but I never wrote code to look through the output and flag potential problems. I always did it manually, and as additional sequence samples were added to the experiment, I forgot to carry out this step.

Moral of the story: if you want to be sure something gets done, every time, you need to automate it completely! Otherwise you might forget.

We'll remove these from our sequence file by hand. Now there are 1120.


DA_228
DQ_822
DV_55
DAA_89