The Markov chain model we'll construct can be used to simulate DNA sequences having the properties built into the model. Now, imagine a genome not as a string of nucleotides, but consisting of individual positions that change through time. Each nucleotide in the sequence (or each amino acid residue in a polypeptide) can be though of as generated by a Markov chain through time. As far as I know, Margaret Dayhoff invented this idea. It is pervasive in phylogenetic analysis of sequences.

We have information in the form of a table of genomic dinucleotide frequencies. From this we can calculate the mononucleotide frequencies: freq('A') = sum(freq('AN')). The transition frequency is the probability of finding a particular base at position i+1, knowing the base at position i.

P(Xi is 'A' |

where the | means "given" (see Bayes rule). (And here is Bayes himself). The latter term is the transition probability. To calculate it, then, we divide the dinucleotide frequency by frequency of the first base.

Deonier have actually pre-computed the transition frequencies for

*M. genitalium*, which makes their R program quite a bit shorter than my Python version. It uses the "sample" function of R, which helpfully takes a probability distribution as one of the arguments. Here is a screenshot of my transcription of the code executed in R:

## No comments:

Post a Comment