Python for Bioinformatics: Dental project (3)

wikimedia

Before starting on analysis of the 1124 sequences from last time (here), we need to check for chimeras.

And at this point, I have a confession to make. It turns out there are 3 and perhaps 4 chimeras in the set of sequences from Genbank. I discovered this unwelcome fact a few weeks ago when playing with the QIIME toolkit. Since one of the pieces of software they recommend is ChimeraSlayer, I tried it out on these sequences.

Make a directory temp with a copy of seqs.fna. The sequences first need to be converted to NAST format, then we can run ChimeraSlayer.pl.

prog1=~/Software/microbiomeutil_2010-11-02/NAST-iEr/run_NAST-iEr.pl
$prog1 --query_FASTA seqs.fna > seqs.nast

prog2=~/Software/microbiomeutil_2010-11-02/ChimeraSlayer/ChimeraSlayer.pl
$prog2 --query_NAST seqs.nast

It takes the better part of an hour on my slowest machine (a 5 year old iMac).

seqs.nast.CPS.CPC.wTaxons has flagged four sequences:

DA228 INTRA-GENUS
DQ822 INTRA-PHYLUM
DV55 INTRA-FAMILY
DAA89 INTRA-FAMILY

I grab those four by hand into a new file suspects.fna (there is probably a better way) and do:

$prog1 --query_FASTA suspects.fna > suspects.nast
$prog2 --query_NAST suspects.nast --printCSalignments option

The output shows there is definitely a problem. In suspects.nast.CPS.CPC.wTaxons we have:

ChimeraSlayer DQ_822 S000427388 S000260335 1.0566 98.74 100 0.7978 74.56 0 YES NAST:1861-1863 ECO:324-325 Streptococcus Streptococcus cristatus (T); NCTC12479; AB008313 Streptococcus cristatus Lachnospiraceae Incertae Sedis Clostridium aerotolerans (T); DSM 5434; X76163 Clostridium aerotolerans INTRA-PHYLUM
            Per_id parents: 73.80

           Per_id(Q,A): 93.45
--------------------------------------------------- A: S000427388
 99.64                                78.63
~~~~~~~~~~~~~~~~~~~~~~~~\ /~~~~~~~~~~~~~~~~~~~~~~~~ Q: DQ_822
DivR: 1.057 BS: 100.00     |
Per_id(QLA,QRB): 98.74   |
                         |
   (L-AB: 72.50)         |      (R-AB: 76.92)
   WinL:0-279            |      WinR:280-396
                         |
Per_id(QLB,QRA): 74.56   |
DivR: 0.798 BS: 0.00    |
~~~~~~~~~~~~~~~~~~~~~~~~/ \~~~~~~~~~~~~~~~~~~~~~~~~~ Q: DQ_822
 72.86                                96.58
---------------------------------------------------- B: S000260335
            Per_id(Q,B): 79.85

DeltaL: 26.79                   DeltaR: -17.95

                                            !               
CTAACGAGGAGGCGCTTGGTTAAGGGCTAGCTAAATTGCATGATGGTCAATGGGAAATCC  A: S000427388
CTAACGAGGAGGCGCTTGGTTAAGGGCTAGCTAAATTGCATGATAGTCAATGGGAAATCC  Q: DQ_822
TCCGACTAAGATTCGGAATCGGGCACAAGATCTCGACAGGCAGCACAGTGGAACTCCGGT  B: S000260335
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! !!!!!!!!!!!!!!!

                 
ACCCTTGTCGTGACATC  A: S000427388
ACCCTTGTCGTGACATC  Q: DQ_822
GGTGCCTGGACCCAGCT  B: S000260335
!!!!!!!!!!!!!!!!!

  ** Breakpoint **

!!!!!!!!!!!!!  !!!!! !!!!!!!
GACCGACGGTCGAGTGTATCGGGGTAAA  A: S000427388
AGAGAGGACCTCGGTTATATGACAGCGG  Q: DQ_822
AGAGAGGACCTCGAGTATATCACAGCAG  B: S000260335
             !!     !     !

The first match is great for a while, then terrible, and the second is the converse.

I need to look into whether I should update the Genbank records, but I guess probably the answer is yes.

Anyway, I should have discovered this easily. I wrote a Python tool that looks for chimeras by BLAST of the front and back "halves" of each sequence against our local "boutique" database. It prints the top five hits for each. Here is the output for three of the suspects:

>DQ822
BLAST front(len = 207):
320        207/207  100.00  Streptococcus_clone_BP2-57_AB121930.1
319        205/207   99.03  Streptococcus_clone_502H08_AM420202.1
323        203/207   98.07  Streptococcus_cristatus_AB008313.1
321        200/207   96.62  Streptococcus_clone_BW009_AY005042.1
334        199/209   95.22  Streptococcus_sanguinis_SK36_SK36
BLAST back(len = 187):
349        179/187   95.72  Uncultured_clone_4.59_DQ346409.1
79         179/187   95.72  Clostridiales_clone_301C11_AM420062.1
121        178/187   95.19  Eubacterium_clone_DO008_AF385508.1
117        178/187   95.19  Eubacterium_clone_BP2-88_AB121960.1
115        178/187   95.19  Eubacterium_clone_BL026B96_AY806377.1

>DV55
BLAST front(len = 210):
353        207/210   98.57  Uncultured_clone_E105_DQ326659.1
100        206/210   98.10  Dialister_sp._E2_20_AF481209.1
94         205/210   97.62  Dialister_invisus_AY162469.1  
31         190/207   91.79  Allisonella_clone_BL34_DQ130020.1
93         192/210   91.43  Dialister_clone_MCE7_134_AF481210.1
BLAST back(len = 190):
373        190/190  100.00  Veillonella_parvula_X84005.1  
372        190/190  100.00  Veillonella_clone_X042_AF287781.1
370        190/190  100.00  Veillonella_clone_BU083_AF366266.1
369        190/190  100.00  Veillonella_clone_AA050_AF287782.1
371        182/183   99.45  Veillonella_clone_R1_DQ123569.1

>DAA89
BLAST front(len = 210):
283        209/210   99.52  Selenomonas_clone_CI002_AF287798.1
286        202/210   96.19  Selenomonas_clone_EQ054_AF385495.1
288        201/210   95.71  Selenomonas_clone_FT050_AY349403.1
298        189/195   96.92  Selenomonas_noxia_AF287799.1  
297        200/210   95.24  Selenomonas_infelix_AF287802.1
BLAST back(len = 190):
373        184/187   98.40  Veillonella_parvula_X84005.1  
370        184/187   98.40  Veillonella_clone_BU083_AF366266.1
372        181/184   98.37  Veillonella_clone_X042_AF287781.1
369        181/184   98.37  Veillonella_clone_AA050_AF287782.1
371        182/187   97.33  Veillonella_clone_R1_DQ123569.1

Note on sequence titles: I just introduced the underscore recently (as in DA_228), so this output doesn't have them.

It's pretty obvious that these guys are problematic. What happened is that I integrated the tool into the toolchain, but I never wrote code to look through the output and flag potential problems. I always did it manually, and as additional sequence samples were added to the experiment, I forgot to carry out this step.

Moral of the story: if you want to be sure something gets done, every time, you need to automate it completely! Otherwise you might forget.

We'll remove these from our sequence file by hand. Now there are 1120.

DA_228
DQ_822
DV_55
DAA_89

Wednesday, March 9, 2011

Dental project (3)