The fundamental problem with MSA is that it scales very poorly. Analysis of the complexity is itself complex (see the discussion in Edgar 2004 PMID 15318951), but Edgar gives it as O(N4 + L2) overall (N sequences of typical length L). Though of similar complexity, MUSCLE is significantly faster. Still, it grinds quite slowly when you have a lot of sequences (say, > 500).
From the first MUSCLE paper:
In agreement with other studies, [e.g. Katoh et al. (8)], we found that T-Coffee was unable to align more than approximately 102 sequences of typical length on a current desktop computer. CLUSTALW was able to align a few hundred sequences, with a practical limit around N = 103 where CPU time begins to scale approximately as N4. The largest set had 5000 sequences of average length 350. MUSCLE-p completed this test in 7 min, compared with 10 min for FFTNS1; we estimate that CLUSTALW would need approximately 1 year.
NAST (Nearest Alignment Space Termination) is one solution to this problem. The basic idea is simple: use an existing multiple sequence alignment as a framework against which to align new sequences. The trick is that addition of a new sequence will not change the length of the MSA. Besides the algorithm, which we'll talk about, a significant part of the effort going into NAST has been the construction of a large, curated database of high quality bacterial and archaeal 16S rRNA sequences.
The implementation has these steps:
• identify the best matching sequence to the candidate in the MSA database
• trim the candidate and do a global alignment to this best match
• resolve positions where the alignment introduces a gap into the template
The example from DeSantis et al is below (T = template; C = candidate). I've shown the template spaces as underlines to distinguish them from alignment gaps:
Collapse the template spacing:
Construct the pairwise alignment
Re-introduce template spacing
If you look at the template in the pairwise alignment you will see four positions containing introduced gap characters. Two of these can be accomodated by gaps in the original template spacing, and two cannot (marked with *). These will have to go.
The * positions must be collapsed to maintain the constant length MSA. In the case of the first one, the aligned GT to its left is displaced. In the second one, the gap between T_A in the candidate is consumed. The final result is:
Note that in both cases, a mismatch was accepted in order to repair the gap. This is the weakness of the NAST approach.
The template may not look exactly the same as when it started, but this difference is not propagated to the database.
We obtained two files from Greengenes following the Qiime tutorial (my post here). The first is a "core_set":
It contains nearly five thousand high quality 16S ribosomal RNA gene sequences that have been aligned. Each one of those sequences consists mostly of '-' gap characters. Almost all of those positions are gaps in every single sequence. These are indexed using the "lane mask":
Naturally, the "core_set" is a large file
I made some fake data to run through pynast:
Sequence B is the same as A, but missing the first nt, C is missing the first 6 nt, and D has a substitution for the first 6 nt.
The template looks like this:
There's a warning about not having code for parallel processing. And here's the alignment:
One of the things that's been overlooked in the present exploration is the crucial contribution of another Robert Edgar program,
uclust. That's used both for MSA and at least one other step in the QIIME pipeline. I need to read the paper and we'll look at it another day.