Python for Bioinformatics: Models of sequence evolution (1): a simulation

Wednesday, February 24, 2010

Models of sequence evolution (1): a simulation

I want to spend some time working on concepts in phylogenetics, and I'm going to start with models of sequence evolution. But before I do that I set up a simulation to visualize the broad outlines of what's going on. So, here is a basic simulation of sequence changes. The script sets up a sequence of length 1000 (50% GC) and then the sequence undergoes repeated rounds of mutation with a mutation rate of 1% for each round. At each step, we calculate the Hamming distance (the number of changes required to turn the evolved sequence back into the original), and convert it to a fraction of the total sequence length.

I plotted the results from 3 independent runs above. (They have been offset slightly for clarity).

The general shape of the curves shows saturation, as expected. Each plot is approximately linear at low levels of mutagenesis: ten cycles of mutagenesis at 1% per cycle gives a Hamming distance of ≈ 0.1. But by 50 cycles the distance is only about 0.35 and by 100 cycles it is about 0.55. This is a simple result of the fact that in any mutation event, sometimes the target has been mutated before, so it takes a long time before all the targets have been hit. Eventually, we do, and the Hamming distance peaks at 75% of the length, since in this example the model of sequence evolution is to change a nucleotide into one of the other three at equal rates.

The script can be modified to alter the initial GC content or the rates, although it hasn't been made particularly easy to do this. I did some runs with modified conditions and observe that: (i) a GC-content far from 50% is ameliorated quite quickly and (ii) even massively unbalanced rates (e.g. rD['T'] = 'ACG' + 'G' * 100) have modest effects on the nucleotide composition. That conflicts with what I think I remember of the theory, so I'll be interested to return to the issue after some reading.

Here's the code:

import random, sys
import numpy as np
import matplotlib.pyplot as plt
nt = 'ACGT'
random.seed(153)

def init(GC = 50):
    N = 5  # 1000 total bases
    seqL =   list('CG') * int(N * GC)
    seqL +=  list('AT') * int(N * (100-GC))
    random.shuffle(seqL)
    return seqL
    
def JC_rates():
    return { 'A':'CGT', 'C':'AGT',
             'G':'ACT', 'T':'ACG' }
    
def mutagenize(Lseq,mrate):
    rD = JC_rates()
    N = int(mrate / 100.0 * len(Lseq))
    X = len(Lseq)
    for i in range(N):
        j = random.choice(range(X))
        nt = Lseq[j]
        Lseq[j] = random.choice(rD[nt])
    
def hamming(s1,s2):
    L = [1 for c1,c2 in zip(s1,s2) if c1 != c2]
    return sum(L)*1.0/len(s1)

def report(seqL):
    pL = [n + ' ' + str(seqL.count(n)) for n in nt]
    return ', '.join(pL)
    
def simulate(seq,N):
    L = seq[:]
    mrate = 1
    Y = list()
    R = [5,10,50,100,200]
    for i in range(N):
        mutagenize(L,mrate)
        d = hamming(seq,L)
        Y.append(d)
        if i+1 in R:
            print str((i+1)*mrate).rjust(4) + '%  ',
            print str(d).ljust(6),
            print report(L)
    return Y
#-----------------------------------------------------

N = 250
seq = init()
print report(seq)
X = np.arange(1,N+1)
colors = ['red','dodgerblue','purple']
drawing = True

for j in range(3):
    Y = simulate(seq,N)
    if drawing:
        # plot at offset to see more clearly
        plt.scatter(X+j*5,Y,s=10,color=colors[j])

if not drawing:  sys.exit()
# dotted lines
line = plt.plot((0,225),(0.75,0.75),color='k',ls=':')
line = plt.plot((10,10),(0,0.8),color='k',ls=':')
line = plt.plot((50,50),(0,0.8),color='k',ls=':')
line = plt.plot((100,100),(0,0.8),color='k',ls=':')
ax = plt.axes()
ax.set_xlim(-2, 265)
ax.set_ylim(0, 0.8)
plt.savefig('example.png')