Wednesday, October 28, 2009

Hidden Markov Models (1)




I want to start a series of posts about Hidden Markov Models or HMMs. In the spirit of the blog, these will be reports from someone who is a biologist by training, who struggled a bit with the mathematical ideas, and then found his way to a basic understanding. Nothing fancy, just an explanation that makes sense to me. And some Python code to go with it.

The best introductions that I know of are (unfortunately not free):

• The first 80 pages of Durbin
• A short article by Sean Eddy (PMID 15470472)
• A series of Dave Swofford lectures that got disappeared off the web
• A chapter by Anders Krogh in this book

Before we get into HMMs let's consider an example from Krogh explaining about Position-Specific Scoring Matrices or PSSMs. Here is a set of 5 aligned sequences:



Assume we have evidence that these are sites with the same biological function (binding data, mutations, or whatever). We want to find more examples in new sequence that we would predict have the same function.

We could use regular expressions to search for this motif. Krogh uses this example of a regular expression to match either C. elegans or Caenorhabditis elegans

"C[\.a-z]* elegans"

The notation formalizes the logic that we require first capital C, then some stuff, then a space, followed by elegans. The stuff in the middle is either a period or any lower case letter from a to z, of any length (that's the * wild-card). The backslash is just to remind ourselves that we really do mean a period.

For the sequence example, we would use [AT] [CG] [AC] [ACGT]* A [TG] [GC].

But, one problem with this approach is that we haven't used the frequency information. So, for example, our RE does not distinguish the highly implausible sequence TGCT--AGG from the consensus sequence ACAC--ATC.

Here is a PSSM for the first three positions in the sequence alignment:



We could do something similar with the last three. But a PSSM has no way simple way to deal with the variable spacing. Particularly if there are multiple variable spaces. This leads us to a simple HMM:



We proceed through a series of states which are considered to "emit" nucleotides with different frequencies, resulting in a probability for emission of any particular sequence. The states may have transition probabilities as well. In more complicated models a given sequence may be emitted by more than one series of states.

Using the basic model, we can calculate expected frequencies for different sequences. For example, the probability of the sequence ACAC--AGC is

0.8 x 0.8 x 0.8 x 0.6 x 0.4 x 0.6 x 1.0 x 0.2 x 0.8 = 0.0118

while for the second sequence TGCT--AGG we get

0.2 x 0.2 x 0.2 x 0.6 x 0.2 x 0.6 x 1.0 x 0.2 x 0.2 = 0.000023

It's probability is (satisfyingly) 500 times smaller than the first one. As you know, multiplication is tedious for us (and slow as well as error-prone for very small numbers, even on a computer). So typically the model's probabilities are specified as logarithms. And we would usually compare the probabilities according to the model with a second model based on random nucleotides at the frequencies observed in the data, although when comparing two different sequences such normalization is factored out.

You might have noticed that it was really lucky :) that both sequences had an A at the position 2 from the end. Otherwise, we'd have multiplied by zero for zero total probability. It doesn't make biological sense to do discard an otherwise excellent match for this reason. It is probably a consequence of having a limited set of authentic sequences. We say that the model is over-fitted to the data, and adjust the data using pseudocounts. So, every nucleotide that had zero probability (because it was never observed in the small set of authentic sequences) might receive a pseudo-count of 1 (or 0.1 or whatever the modeler decides is reasonable).

As an aside, PSSMs work reasonably well for many problems. I've used them a lot, and maybe I'll write about that another time.