Python for Bioinformatics: Alignment: Needleman-Wunsch

Saturday, August 8, 2009

Alignment: Needleman-Wunsch

Sergey called and said that if I don't do any bioinformatics examples they are going to make me change the name of the blog. :)

Here is a first installment.

The Needleman-Wunsch algorithm (PMID: 5420325---if this reference doesn't make any sense to you, just enter that number into the search box on this page), carries out a global alignment on two sequences. The wikipedia article on this topic is not very clear, in my opinion. The NW method produces a global rather than a local alignment. It is guaranteed to be the best possible (highest scoring) alignment. It is not guaranteed to be biologically relevant.

I found a nice explanation in Dave Swofford's lectures. Maybe you can find them, but his employer FSU has taken them down, apparently. That's too bad, because he helped me understand lots of things, most important, the methods for Hidden Markov Models, like the Viterbi algorithm, etc., which I hope to get to in a future post.

[In a perfect world, we would all put our lecture notes on line. But there are problems: our employers seem to think they "own" these presentations. And then there are people like me, who shamelessly use graphics snatched off the web to illustrate our slides. Due to stupid copyright issues, I can't put them up for you to see.]

There is a nice page about NW here. The example I will use is from Sean Eddy (PMID 15229554). He has several other very nice tutorials as well.

Let's say we want to align these two sequences: TGCTCGTA and TTCATA. We need scoring rules to evaluate each position in the alignment. Suppose we take:

• match = +5
• mismatch = -2
• insertion = -6

Notice this is an empirical procedure. We choose parameters which make the ultimate solutions seem correct. Set up a table like this:

At each turn, we move either down, right, or both right and down (on the diagonal). We start with an extra column at the left and an extra row at the top. Fill out these extra columns as shown. Moving right corresponds to an alignment with a gap in the sequence TTCATA, while moving down corresponds to a accepting a gap in the other sequence. Each additional base opposite a gap accumulates a penalty of -6 in our scoring system.

We calculate a running score for each position. In the top row, there is only one way to get to a square (from the left). Hence we add an additional -6 to the score for each move across the top. Similar logic applies for the left column.

Having finished the preliminaries, the first interesting position is the square where the initial nucleotide of each sequence lines up. We make the following calculation. We can do one of three things:

• move down, add -6 to the score from the square above
• move right, add -6 to the score from the square to the left
• move diagonally, and take the score from the upper left
in this case, evaluate the match or mismatch to find the score

We calculate these values, and take the maximum value. The mathematicians have a fancy way of writing that:

(These graphics are from Durbin et al, another very fine book).

We write the result of our calculation into the square, and also write down an arrow that shows which path we took.

In this case, we got +5 for the match, while the potential moves down or right had scores of -12. We continue to fill in the table. Here, I show the result, but I have not shown all of the arrows.

The last step is to begin at the right hand bottom corner, and follow the arrows in reverse back to the beginning. This is called the traceback. The resulting alignment is:

Notice the correspondence of the gaps in the top sequence with horizontal arrows in the table.

NW is a dynamic programming algorithm. We:

• break a big problem into smaller problems
• these have solutions that can be built up in stages
• keep track of (memoize) solutions to the small problems

• gives an alignment which is guaranteed to be the "best"
• will happily align unrelated sequences
• is computationally expensive: O(n²)