Python for Bioinformatics: Hidden Markov Models (5)

Saturday, November 14, 2009

Hidden Markov Models (5)

Building on the prrevious post in this series, we have an HMM with two states G (good) and B (bad) weather, that have characteristic emission and transition probabilities. Let's add a begin state to the model, with P = 0.5 for each possible transition.

Now, use the Viterbi algorithm to compute the most probable sequence of hidden states corresponding to some observed data. Suppose the observations are Sun-Rain-Sun. We begin with Sun. Make a table with columns for the observations and rows for the states. For the first column and row, we compute the product of the transition probability times an emission probability. This product is proportional to the probability of reaching the state, but it is not a real probability since the combined values don't add up to 1 (these are likelihoods).

Now, we add the second observation, Rain, in the second column.

Work right to left. For the G state there are two possible paths to get there, either from the previous G or from the previous B. Compute first the transition probability for B -> G times the previous cumulative probability for state B. Also compute the transition probability for G -> G times the previous cumulative probability for state G. Compare the two and take the maximum, that is, G -> G, multiply by the emission probability (Rain from G) and put the result at the bottom of the box.

Draw a horizontal arrow to remember how we got here (the traceback). Continue with the other possible state for this observation, in the second row, second column.

Notice the correspondence with the first row: probabilities on the left of the two terms in parentheses add to 1, while the cumulative probabilities being carried forward are the unchanged. In this case, the maximum value corresponds to G -> B, so we draw a diagonal arrow.

And so on:

Now, look at the right-hand column and pick the state with the highest cumulative likelihood. Follow the traceback to figure out how we got here: GBB is the most probable sequence of hidden states.

[ Two math errors were spotted by a sharp-eyed reader that alter the results in the last part, and change the most probable sequence to GBG. Sorry for the confusion. I don't have the slides anymore to edit them easily so I'll leave it as it is. ]