Saturday, August 8, 2009

Alignment: Gotoh, and Smith-Waterman

In sequence alignment, we want to penalize gaps heavily so that we don't get this:

However, insertions and deletions (indels) don't happen one nucleotide at a time. Instead, it is thought that they tend to occur in the surface exposed loops of proteins, rather than in the core. Core regions are tightly packed and unlikely to tolerate much disruption. I need to look for papers about the above assertions, but let's continue. The idea of affine gap penalties is that there is a large hit for allowing a gap in the alignment, but a rather small penalty for extending one. We might have:

• gap opening penalty = -12
• gap extension penalty = -2

In a real problem, we would follow this approach. However, I want to keep it simple, so for this post, we are going to do as before, and have one penalty for each character (base or amino acid) opposite a gap (-8).

We want to align protein sequences this time: HEAGAWGHEE and PAWHEAE. We'll use one of the BLOSUM scoring matrices to do this (Henikoff & Henikoff, PMID: 1438297. Here, I have copied out the relevant pairwise scores from the BLOSUM50 matrix:

We set up the table for the alignment scores exactly as before:

We fill it in using the same rules:

Having cached the arrows which show how we got to each square, we can do the traceback starting in the lower right-hand corner and work our way back:

The alignment is then:

As before, notice that going up corresponds to a gap in sequence 1, while going left is a gap in sequence 2.

This particular application of NW is said to be due to Gotoh (PMID: 7166760). His paper in JMB is buried in the bowels of the central library about 2 miles from my desk. So I haven't actually read it. If you have a copy, I'd be grateful.

Smith-Waterman is a modification of this approach. It is an algorithm for local sequence alignment. It is the same as before, but with a simple new idea: if the accumulated score goes negative, set it equal to zero. And start the traceback from the maximum score:

This optimization eliminates the noise of poorly matched segments. SSearch (here) is a commonly used implementation. It is still computationally expensive (O(n2). We will need something faster.