Saturday, August 8, 2009

Alignment: Gotoh, and Smith-Waterman

In sequence alignment, we want to penalize gaps heavily so that we don't get this:



However, insertions and deletions (indels) don't happen one nucleotide at a time. Instead, it is thought that they tend to occur in the surface exposed loops of proteins, rather than in the core. Core regions are tightly packed and unlikely to tolerate much disruption. I need to look for papers about the above assertions, but let's continue. The idea of affine gap penalties is that there is a large hit for allowing a gap in the alignment, but a rather small penalty for extending one. We might have:

• gap opening penalty = -12
• gap extension penalty = -2

In a real problem, we would follow this approach. However, I want to keep it simple, so for this post, we are going to do as before, and have one penalty for each character (base or amino acid) opposite a gap (-8).

We want to align protein sequences this time: HEAGAWGHEE and PAWHEAE. We'll use one of the BLOSUM scoring matrices to do this (Henikoff & Henikoff, PMID: 1438297. Here, I have copied out the relevant pairwise scores from the BLOSUM50 matrix:



We set up the table for the alignment scores exactly as before:



We fill it in using the same rules:



Having cached the arrows which show how we got to each square, we can do the traceback starting in the lower right-hand corner and work our way back:



The alignment is then:



As before, notice that going up corresponds to a gap in sequence 1, while going left is a gap in sequence 2.

This particular application of NW is said to be due to Gotoh (PMID: 7166760). His paper in JMB is buried in the bowels of the central library about 2 miles from my desk. So I haven't actually read it. If you have a copy, I'd be grateful.

Smith-Waterman is a modification of this approach. It is an algorithm for local sequence alignment. It is the same as before, but with a simple new idea: if the accumulated score goes negative, set it equal to zero. And start the traceback from the maximum score:



This optimization eliminates the noise of poorly matched segments. SSearch (here) is a commonly used implementation. It is still computationally expensive (O(n2). We will need something faster.

3 comments:

joe said...

Hi Tom,

What timing! I just read your recent blog entry about the Smith-Waterman algorithm, and found it very helpful, as I'm learning Python and interested in bioinformatics.

I've just written a Python3 S-W implementation. It reads 2 FASTA files and determines the score (just linear gaps, without Gotoh's modification so far). Then it writes the alignments, symbol line and some statistics to a file.

To get how the algorithm works, I used a small tutorial on dynamic programming by an IBM employee named Paul Reiners at
http://www.ibm.com/developerworks/java/library/j-seqalign/index.html.

His examples were in Java, which is impenetrable to me. But his step-by-step explanation is, like yours, very clear. Actually yours is clearer. I got the rest from Smith and Waterman's article:

Smith TF, Waterman MS (1981). "Identification of Common Molecular Subsequences". Journal of Molecular Biology 147: 195–197.

It's actually understandable by a non-mathematician. If you haven't seen it, you can access it from the references in the S-W article in Wikipedia. (Unfortunately, the Gotoh article from '82 doesn't seem to be accessible without a fee over the internet.)

I wonder if you know of a python (2 or 3) implementation of SW-Gotoh?

If you'd like to see my source code, I'd be very happy to email it. It's all open source, so you could do whatever you wanted with it.

Thanks for sharing your experience on the blog - though much of it is over my head.

Joe Oettinger

PS I'm just doing this stuff out of intellectual curiosity. I'm a retired Ob-Gyn. I read your 'aboutme' on the West VA University website, and I think I can top your early Apple experience - I bought a TRS-80 with 4 kb in '78 or '79. Now I'm using Ubuntu on an HP pavilion laptop. I got the laptop from my son-in-law for free because it was terminally virus-infected.

telliott99 said...

Hi Joe,

Thanks for your comments. I have a copy of Gotoh that a reader sent me. Shoot me an email if you want it.

I've been distracted with Cocoa stuff recently, but hope to get back to what I'm supposed to be doing soon.

Tom

adrovane said...

Hi.

I'm too looking for Python Smith-Waterman and Gotoh implementations. I would be very grateful if you could sent it to me.

Regards,
Adrovane.