Monday, February 21, 2011

Qiime (4) PyNAST

Today I've been on a side-track from the QIIME pipeline (or toolkit), related to the construction of multiple sequence alignments (MSA). The favored approach in QIIME is that of NAST (DeSantis et al. 2006 PMID 16845035) as reimplemented in PyNAST using Python (Caporaso et al. 2010 PMID 19914921).

The fundamental problem with MSA is that it scales very poorly. Analysis of the complexity is itself complex (see the discussion in Edgar 2004 PMID 15318951), but Edgar gives it as O(N4 + L2) overall (N sequences of typical length L). Though of similar complexity, MUSCLE is significantly faster. Still, it grinds quite slowly when you have a lot of sequences (say, > 500).

From the first MUSCLE paper:


In agreement with other studies, [e.g. Katoh et al. (8)], we found that T-Coffee was unable to align more than approximately 102 sequences of typical length on a current desktop computer. CLUSTALW was able to align a few hundred sequences, with a practical limit around N = 103 where CPU time begins to scale approximately as N4. The largest set had 5000 sequences of average length 350. MUSCLE-p completed this test in 7 min, compared with 10 min for FFTNS1; we estimate that CLUSTALW would need approximately 1 year.


NAST (Nearest Alignment Space Termination) is one solution to this problem. The basic idea is simple: use an existing multiple sequence alignment as a framework against which to align new sequences. The trick is that addition of a new sequence will not change the length of the MSA. Besides the algorithm, which we'll talk about, a significant part of the effort going into NAST has been the construction of a large, curated database of high quality bacterial and archaeal 16S rRNA sequences.

The implementation has these steps:

• identify the best matching sequence to the candidate in the MSA database
• trim the candidate and do a global alignment to this best match
• resolve positions where the alignment introduces a gap into the template

The example from DeSantis et al is below (T = template; C = candidate). I've shown the template spaces as underlines to distinguish them from alignment gaps:


T:  ATAC_____GTA_AC____GTA___C___G_T_AC_GG


Collapse the template spacing:


T:  ATACGTAACGTACGTACGG
C: CACGTTAAACGTCGTACCCGG


Construct the pairwise alignment


T:  ATACGT-A-ACGTACGTAC--GG
C: C-ACGTTAAACGT-CGTACCCGG


Re-introduce template spacing


               *                         *
T: ATAC_____GT-A-AC____GTA___C___G_T_AC--GG
C: C-AC_____GTTAAAC____GT-___C___G_T_ACCCGG


If you look at the template in the pairwise alignment you will see four positions containing introduced gap characters. Two of these can be accomodated by gaps in the original template spacing, and two cannot (marked with *). These will have to go.

The * positions must be collapsed to maintain the constant length MSA. In the case of the first one, the aligned GT to its left is displaced. In the second one, the gap between T_A in the candidate is consumed. The final result is:


T:  ATAC_____GT-AAC____GTA___C___G_T_AC_GG
C: C-AC____GTTAAAC____GT-___C___G_TACCCGG


Note that in both cases, a mismatch was accepted in order to repair the gap. This is the weakness of the NAST approach.

The template may not look exactly the same as when it started, but this difference is not propagated to the database.

We obtained two files from Greengenes following the Qiime tutorial (my post here). The first is a "core_set":


>>> FH = open('core_set_aligned.fasta.imputed.txt')
>>> data = FH.read().strip()
>>> FH.close()
>>> data.count('>')
4938
>>> first = data.split('>')[1].strip().split('\n',1)[1]
>>> len(first)
7682


It contains nearly five thousand high quality 16S ribosomal RNA gene sequences that have been aligned. Each one of those sequences consists mostly of '-' gap characters. Almost all of those positions are gaps in every single sequence. These are indexed using the "lane mask":


>>> FH = open('lanemask_in_1s_and_0s.txt')
>>> mask = FH.read().strip()
>>> FH.close()
>>> len(mask)
7682
>>> mask.count('0')
6395
>>> mask.count('1')
1287
>>> L = [c for c,m in zip(first,mask) if m == '1']
>>> print ''.join(L)
>>> print ''.join(L)[:50]
ATTGAACGCTGGCGGTATGCTTAACACATGCAAGTCGAACG-AGTGGCGG
>>> len(L)
1287


Naturally, the "core_set" is a large file


>>> data.count('>')*len(mask)
37933716

> ls -l core_set_aligned.fasta.imputed.txt
-rw-r--r--@ 1 telliott staff 37975992 Feb 19 06:07 core_set_aligned.fasta.imputed.txt


I made some fake data to run through pynast:


>0 A
GAGTTTGATCCTGGCTCAGATTGAACGCTGGCGGTATGCTT
>1 B
AGTTTGATCCTGGCTCAGATTGAACGCTGGCGGTATGCTT
>2 C
GAGGATCCTGGCTCAGATTGAACGCTGGCGGTATGCTT
>3 D
CCCGAGGATCCTGGCTCAGATTGAACGCTGGCGGTATGCTT


Sequence B is the same as A, but missing the first nt, C is missing the first 6 nt, and D has a substitution for the first 6 nt.

The template looks like this:


>Y
GAGTTT-GA--T-CC-T-G-GCTC-AG-AT-TGAA-C-GC--TGG-C--G-GT-A-TG--C----T-T
>Z
gagttt-ga--t-cc-t-g-gctc-ag-at-tgaa-c-gc--tgg-c--g-gc-a-gg--c----c-t



> pynast -i data.txt -t template.txt -l 25
/Library/Python/2.6/site-packages/cogent/evolve/likelihood_tree.py:6: UserWarning: Not using MPI as mpi4py not found
from cogent.util.parallel import MPI


There's a warning about not having code for parallel processing. And here's the alignment:


>0 A 1..41
GAGTTT-GA--T-CC-T-G-GCTC-AG-AT-TGAA-C-GC--TGG-C--G-GT-A-TG--C----T-T
>1 B 1..40
-AGTTT-GA--T-CC-T-G-GCTC-AG-AT-TGAA-C-GC--TGG-C--G-GT-A-TG--C----T-T
>2 C 1..38
---GAG-GA--T-CC-T-G-GCTC-AG-AT-TGAA-C-GC--TGG-C--G-GT-A-TG--C----T-T
>3 D 1..40
-CCGAG-GA--T-CC-T-G-GCTC-AG-AT-TGAA-C-GC--TGG-C--G-GT-A-TG--C----T-T


One of the things that's been overlooked in the present exploration is the crucial contribution of another Robert Edgar program, uclust. That's used both for MSA and at least one other step in the QIIME pipeline. I need to read the paper and we'll look at it another day.