Python for Bioinformatics: Elementary Markov Chain

Saturday, August 9, 2008

Elementary Markov Chain

In this section (Deonier Ch. 2 example 3) the Markov chain is introduced using the example of dinucleotide frequencies. (Here is Markov himself). In scanning a genome we might ask, given that the nucleotide at a certain position is 'A', what is the probability that the next nucleotide in the sequence is 'G'? We make the important (and obviously unrealistic) assumption that the next base depends only on the identity of the current one.

The Markov chain model we'll construct can be used to simulate DNA sequences having the properties built into the model. Now, imagine a genome not as a string of nucleotides, but consisting of individual positions that change through time. Each nucleotide in the sequence (or each amino acid residue in a polypeptide) can be though of as generated by a Markov chain through time. As far as I know, Margaret Dayhoff invented this idea. It is pervasive in phylogenetic analysis of sequences.

We have information in the form of a table of genomic dinucleotide frequencies. From this we can calculate the mononucleotide frequencies: freq('A') = sum(freq('AN')). The transition frequency is the probability of finding a particular base at position i+1, knowing the base at position i.

P(Xi is 'A' and Xi+1 is 'G') = 
    P(Xi is 'A') * P(Xi+1 is 'G' | Xi is 'A')

where the | means "given" (see Bayes rule). (And here is Bayes himself). The latter term is the transition probability. To calculate it, then, we divide the dinucleotide frequency by frequency of the first base.

Deonier have actually pre-computed the transition frequencies for M. genitalium, which makes their R program quite a bit shorter than my Python version. It uses the "sample" function of R, which helpfully takes a probability distribution as one of the arguments. Here is a screenshot of my transcription of the code executed in R: