Friday, February 26, 2010

Jukes-Cantor (4)

If we think of evolutionary changes in sequences in a discrete fashion, then we could pick an α that would give, in one unit of evolutionary time, a 1% chance that nucleotide X => nucleotide Y. We can model this by matrix multiplication. We set up an initial Identity matrix and then process it to give the rate matrix T.

     A       C       G       T
A 0.97 0.01 0.01 0.01
C 0.01 0.97 0.01 0.01
G 0.01 0.01 0.97 0.01
T 0.01 0.01 0.01 0.97

A starts out as equal to I, but repeated rounds of T•A yield the results below.

This can easily be modified to change the sequence model, although I haven't shown it here.

However, I notice that there is an inconsistency between this model and the simulation of mutation from here. The simulation showed saturation around 200% mutation, whereas this matrix method is clearly saturated at 100 rounds of 1%. I believe we may be missing a factor of 2, but I don't have any idea how that would be. I will think about it in days to come, but if you can spot the answer please let me know.

[ UPDATE: <whacks self on head> The previous simulation was for 1% mutation in each round, while this one has 3% in each round (1% for each nucleotide). That's the source of the difference! ]


0
A C G T
A 1.0 0.0 0.0 0.0
C 0.0 1.0 0.0 0.0
G 0.0 0.0 1.0 0.0
T 0.0 0.0 0.0 1.0

1
A C G T
A 0.97 0.01 0.01 0.01
C 0.01 0.97 0.01 0.01
G 0.01 0.01 0.97 0.01
T 0.01 0.01 0.01 0.97

2
A C G T
A 0.9412 0.0196 0.0196 0.0196
C 0.0196 0.9412 0.0196 0.0196
G 0.0196 0.0196 0.9412 0.0196
T 0.0196 0.0196 0.0196 0.9412


import numpy as np
nt = 'ACGT'
def show(A,x):
print x
print ' ' + ' '.join(list(nt))
for i,e in enumerate(A):
print nt[i] + ' ',
L = [str(round(n,4)).ljust(7) for n in e]
print ' '.join(L)
print

I = np.eye(4)
A = I
T = I * 0.96 + 0.01
show(A,0)
for i in range(1,101):
A = np.dot(T,A)
R = range(2,10,2) + range(20,101,10)
if i in [1] + R:
show(A,i)


0
A C G T
A 1.0 0.0 0.0 0.0
C 0.0 1.0 0.0 0.0
G 0.0 0.0 1.0 0.0
T 0.0 0.0 0.0 1.0

1
A C G T
A 0.97 0.01 0.01 0.01
C 0.01 0.97 0.01 0.01
G 0.01 0.01 0.97 0.01
T 0.01 0.01 0.01 0.97

2
A C G T
A 0.9412 0.0196 0.0196 0.0196
C 0.0196 0.9412 0.0196 0.0196
G 0.0196 0.0196 0.9412 0.0196
T 0.0196 0.0196 0.0196 0.9412

4
A C G T
A 0.887 0.0377 0.0377 0.0377
C 0.0377 0.887 0.0377 0.0377
G 0.0377 0.0377 0.887 0.0377
T 0.0377 0.0377 0.0377 0.887

6
A C G T
A 0.8371 0.0543 0.0543 0.0543
C 0.0543 0.8371 0.0543 0.0543
G 0.0543 0.0543 0.8371 0.0543
T 0.0543 0.0543 0.0543 0.8371

8
A C G T
A 0.791 0.0697 0.0697 0.0697
C 0.0697 0.791 0.0697 0.0697
G 0.0697 0.0697 0.791 0.0697
T 0.0697 0.0697 0.0697 0.791

20
A C G T
A 0.5815 0.1395 0.1395 0.1395
C 0.1395 0.5815 0.1395 0.1395
G 0.1395 0.1395 0.5815 0.1395
T 0.1395 0.1395 0.1395 0.5815

30
A C G T
A 0.4704 0.1765 0.1765 0.1765
C 0.1765 0.4704 0.1765 0.1765
G 0.1765 0.1765 0.4704 0.1765
T 0.1765 0.1765 0.1765 0.4704

40
A C G T
A 0.3965 0.2012 0.2012 0.2012
C 0.2012 0.3965 0.2012 0.2012
G 0.2012 0.2012 0.3965 0.2012
T 0.2012 0.2012 0.2012 0.3965

50
A C G T
A 0.3474 0.2175 0.2175 0.2175
C 0.2175 0.3474 0.2175 0.2175
G 0.2175 0.2175 0.3474 0.2175
T 0.2175 0.2175 0.2175 0.3474

60
A C G T
A 0.3148 0.2284 0.2284 0.2284
C 0.2284 0.3148 0.2284 0.2284
G 0.2284 0.2284 0.3148 0.2284
T 0.2284 0.2284 0.2284 0.3148

70
A C G T
A 0.2931 0.2356 0.2356 0.2356
C 0.2356 0.2931 0.2356 0.2356
G 0.2356 0.2356 0.2931 0.2356
T 0.2356 0.2356 0.2356 0.2931

80
A C G T
A 0.2786 0.2405 0.2405 0.2405
C 0.2405 0.2786 0.2405 0.2405
G 0.2405 0.2405 0.2786 0.2405
T 0.2405 0.2405 0.2405 0.2786

90
A C G T
A 0.269 0.2437 0.2437 0.2437
C 0.2437 0.269 0.2437 0.2437
G 0.2437 0.2437 0.269 0.2437
T 0.2437 0.2437 0.2437 0.269

100
A C G T
A 0.2627 0.2458 0.2458 0.2458
C 0.2458 0.2627 0.2458 0.2458
G 0.2458 0.2458 0.2627 0.2458
T 0.2458 0.2458 0.2458 0.2627