data:image/s3,"s3://crabby-images/3272a/3272a46788bbf1768c440e3d3c05eac985960cee" alt=""
However, insertions and deletions (indels) don't happen one nucleotide at a time. Instead, it is thought that they tend to occur in the surface exposed loops of proteins, rather than in the core. Core regions are tightly packed and unlikely to tolerate much disruption. I need to look for papers about the above assertions, but let's continue. The idea of affine gap penalties is that there is a large hit for allowing a gap in the alignment, but a rather small penalty for extending one. We might have:
• gap opening penalty = -12
• gap extension penalty = -2
In a real problem, we would follow this approach. However, I want to keep it simple, so for this post, we are going to do as before, and have one penalty for each character (base or amino acid) opposite a gap (-8).
We want to align protein sequences this time: HEAGAWGHEE and PAWHEAE. We'll use one of the BLOSUM scoring matrices to do this (Henikoff & Henikoff, PMID: 1438297. Here, I have copied out the relevant pairwise scores from the BLOSUM50 matrix:
data:image/s3,"s3://crabby-images/24c7b/24c7b1f5ca7ad872c0ae64d98f7162acbaf6b22c" alt=""
We set up the table for the alignment scores exactly as before:
data:image/s3,"s3://crabby-images/4d38c/4d38cda94437314e5ecb47be77fc281b15e62127" alt=""
We fill it in using the same rules:
data:image/s3,"s3://crabby-images/5c721/5c7213b12f3b563fa0181736557c4f4d6ca63f90" alt=""
Having cached the arrows which show how we got to each square, we can do the traceback starting in the lower right-hand corner and work our way back:
data:image/s3,"s3://crabby-images/9d040/9d0405b770a4c9c2027bd8cd265222c0ac6eed4f" alt=""
The alignment is then:
data:image/s3,"s3://crabby-images/9d728/9d728b79cbccb1654d4f1b3ca86145d0b24bf808" alt=""
As before, notice that going up corresponds to a gap in sequence 1, while going left is a gap in sequence 2.
This particular application of NW is said to be due to Gotoh (PMID: 7166760). His paper in JMB is buried in the bowels of the central library about 2 miles from my desk. So I haven't actually read it. If you have a copy, I'd be grateful.
Smith-Waterman is a modification of this approach. It is an algorithm for local sequence alignment. It is the same as before, but with a simple new idea: if the accumulated score goes negative, set it equal to zero. And start the traceback from the maximum score:
data:image/s3,"s3://crabby-images/5fcf4/5fcf41b2a0129d0be371cf88e32cef9961cd945c" alt=""
This optimization eliminates the noise of poorly matched segments. SSearch (here) is a commonly used implementation. It is still computationally expensive (O(n2). We will need something faster.