Python for Bioinformatics: Hidden Markov Models (3)

Friday, November 13, 2009

Hidden Markov Models (3)

This post is about the "occasionally dishonest casino" from Durbin et al., Biological Sequence Analysis. To save effort, I'll just use my slide from class:

We're going to model this system in Python, starting with two sets of dice with the requisite probabilities, and extending on into HMM analysis. Here is typical output for 100 rolls of the dice. We show the state above, and the output below:

FFFFFFLFLLLFFFFFFFFFFFFLFFFFFFLLLFFFLLLFLLLLLLLLLL
44434462641561516315232623134536312512613662626653

LLLFFFFFFFFFFFLLLFFFFFFFFFFFFFFFFFFFFFFLLLLLLLLLFF
56526642512146625665112531415535611163366666636225

Following, in segments, is the code for the Model class.

import random
outcomes = '123456'
    
class Model:
    def __init__(self, pFL, pLF):
        self.emit = list(outcomes)
        self.fairProb = [1.0/6] * 6
        self.loadProb = [0.1] * 5 + [0.5]
        self.tranProb = { 'F':pFL,'L':pLF }
                                
    def setState(self,which=None):
        if which:  self.state = which
        else:      self.state = random.choice('FL')

The initialization function is kept simple. It requires a value for the transition probability to switch states in either direction. The emission probabilities are hard-coded. The state will be set at the beginning of a run. Rolling the dice (below) is straightforward. We test the result of random.random() against the cumulative probability and emit the corresponding outcome.

    def rollDice(self):
        if self.state == 'F':  L = self.fairProb
        else:                  L = self.loadProb
        f = random.random()
        S = 0
        for i in range(len(L)):
            S += L[i]
            if f < S:  break
        return self.emit[i]

    def transit(self):
        def switch():
            if self.state == 'F':  self.state = 'L'
            else:                  self.state = 'F'
        p = self.tranProb[self.state]
        f = random.random()
        if f < p:  switch()

random.random() is used a second time to decide whether to switch on a particular transition. The object can generate sequences of as many states as needed, with a default of 50.

    def sequence(self,N=50):
        rolls = list()
        states = list()
        for i in range(N):
            rolls.append(self.rollDice())
            states.append(self.state)
            self.transit()
        return ''.join(rolls),''.join(states)
                
if __name__ == '__main__':
    m = Model(pFL=0.1,pLF=0.2)
    m.setState('F')
    L = [m.sequence() for i in range(2)]
    for r,s in L:   print s + '\n' + r + '\n'
    
    import diceStats as St 
    St.checkEStats(m)
    St.checkTStats(m)

It is useful to check that model performs as expected. The code for this is in another module, which we import here. But first, to motivate us, here is typical output for this model:

1   F 0.17    L 0.097
2   F 0.166   L 0.107
3   F 0.164   L 0.099
4   F 0.162   L 0.097
5   F 0.174   L 0.096
6   F 0.164   L 0.505

F 6748 0.098
L 3252 0.204

The different outcomes 1 through 6 are emitted with the correct probabilities. The last two lines list the number of times out of 10000 that the model was in the fair or loaded state. The ratio is determined by ratio of the transition probabilities, which are also determined.

Here is the diceStats module:

outcomes = '123456'

def checkEStats(m):
    N = 10000
    m.state = 'F'
    fairL = [m.rollDice() for i in range(N)]
    m.state = 'L'
    loadL = [m.rollDice() for i in range(N)]
    for o in outcomes:
        print o, '  F', 
        print str(round(fairL.count(o) * 1.0 / N, 3)).ljust(5),
        print    '  L', 
        print str(round(loadL.count(o) * 1.0 / N, 3)).ljust(5)
    print
            
def checkTStats(m):
    N = 10000
    L = list()
    for i in range(N):
        L.append(m.state)
        m.transit()
    totalF = L.count('F')
    totalL = L.count('L')
    D = { 'F':0,'L':0 }
    c1 = L[0]
    for c2 in L[1:]:
        if c1 != c2:  D[c1] += 1
        c1 = c2
        
    print 'F', totalF, 
    print str(round(D['F'] * 1.0 / totalF, 3)).ljust(5)
    print 'L', totalL, 
    print str(round(D['L'] * 1.0 / totalL, 3)).ljust(5)