Python for Bioinformatics: Jukes-Cantor (4)

Friday, February 26, 2010

Jukes-Cantor (4)

If we think of evolutionary changes in sequences in a discrete fashion, then we could pick an α that would give, in one unit of evolutionary time, a 1% chance that nucleotide X => nucleotide Y. We can model this by matrix multiplication. We set up an initial Identity matrix and then process it to give the rate matrix T.

     A       C       G       T
A  0.97    0.01    0.01    0.01   
C  0.01    0.97    0.01    0.01   
G  0.01    0.01    0.97    0.01   
T  0.01    0.01    0.01    0.97

A starts out as equal to I, but repeated rounds of T•A yield the results below.

This can easily be modified to change the sequence model, although I haven't shown it here.

However, I notice that there is an inconsistency between this model and the simulation of mutation from here. The simulation showed saturation around 200% mutation, whereas this matrix method is clearly saturated at 100 rounds of 1%. I believe we may be missing a factor of 2, but I don't have any idea how that would be. I will think about it in days to come, but if you can spot the answer please let me know.

[ UPDATE: <whacks self on head> The previous simulation was for 1% mutation in each round, while this one has 3% in each round (1% for each nucleotide). That's the source of the difference! ]

0
     A       C       G       T
A  1.0     0.0     0.0     0.0    
C  0.0     1.0     0.0     0.0    
G  0.0     0.0     1.0     0.0    
T  0.0     0.0     0.0     1.0    

1
     A       C       G       T
A  0.97    0.01    0.01    0.01   
C  0.01    0.97    0.01    0.01   
G  0.01    0.01    0.97    0.01   
T  0.01    0.01    0.01    0.97   

2
     A       C       G       T
A  0.9412  0.0196  0.0196  0.0196 
C  0.0196  0.9412  0.0196  0.0196 
G  0.0196  0.0196  0.9412  0.0196 
T  0.0196  0.0196  0.0196  0.9412

import numpy as np
nt = 'ACGT'
def show(A,x):
    print x
    print '     ' + '       '.join(list(nt))
    for i,e in enumerate(A):
        print nt[i] + ' ',
        L = [str(round(n,4)).ljust(7) for n in e]
        print ' '.join(L)
    print 

I = np.eye(4)
A = I
T = I * 0.96 + 0.01
show(A,0)
for i in range(1,101):
    A = np.dot(T,A)
    R = range(2,10,2) + range(20,101,10)
    if i in [1] + R:
        show(A,i)

0
     A       C       G       T
A  1.0     0.0     0.0     0.0    
C  0.0     1.0     0.0     0.0    
G  0.0     0.0     1.0     0.0    
T  0.0     0.0     0.0     1.0    

1
     A       C       G       T
A  0.97    0.01    0.01    0.01   
C  0.01    0.97    0.01    0.01   
G  0.01    0.01    0.97    0.01   
T  0.01    0.01    0.01    0.97   

2
     A       C       G       T
A  0.9412  0.0196  0.0196  0.0196 
C  0.0196  0.9412  0.0196  0.0196 
G  0.0196  0.0196  0.9412  0.0196 
T  0.0196  0.0196  0.0196  0.9412 

4
     A       C       G       T
A  0.887   0.0377  0.0377  0.0377 
C  0.0377  0.887   0.0377  0.0377 
G  0.0377  0.0377  0.887   0.0377 
T  0.0377  0.0377  0.0377  0.887  

6
     A       C       G       T
A  0.8371  0.0543  0.0543  0.0543 
C  0.0543  0.8371  0.0543  0.0543 
G  0.0543  0.0543  0.8371  0.0543 
T  0.0543  0.0543  0.0543  0.8371 

8
     A       C       G       T
A  0.791   0.0697  0.0697  0.0697 
C  0.0697  0.791   0.0697  0.0697 
G  0.0697  0.0697  0.791   0.0697 
T  0.0697  0.0697  0.0697  0.791  

20
     A       C       G       T
A  0.5815  0.1395  0.1395  0.1395 
C  0.1395  0.5815  0.1395  0.1395 
G  0.1395  0.1395  0.5815  0.1395 
T  0.1395  0.1395  0.1395  0.5815 

30
     A       C       G       T
A  0.4704  0.1765  0.1765  0.1765 
C  0.1765  0.4704  0.1765  0.1765 
G  0.1765  0.1765  0.4704  0.1765 
T  0.1765  0.1765  0.1765  0.4704 

40
     A       C       G       T
A  0.3965  0.2012  0.2012  0.2012 
C  0.2012  0.3965  0.2012  0.2012 
G  0.2012  0.2012  0.3965  0.2012 
T  0.2012  0.2012  0.2012  0.3965 

50
     A       C       G       T
A  0.3474  0.2175  0.2175  0.2175 
C  0.2175  0.3474  0.2175  0.2175 
G  0.2175  0.2175  0.3474  0.2175 
T  0.2175  0.2175  0.2175  0.3474 

60
     A       C       G       T
A  0.3148  0.2284  0.2284  0.2284 
C  0.2284  0.3148  0.2284  0.2284 
G  0.2284  0.2284  0.3148  0.2284 
T  0.2284  0.2284  0.2284  0.3148 

70
     A       C       G       T
A  0.2931  0.2356  0.2356  0.2356 
C  0.2356  0.2931  0.2356  0.2356 
G  0.2356  0.2356  0.2931  0.2356 
T  0.2356  0.2356  0.2356  0.2931 

80
     A       C       G       T
A  0.2786  0.2405  0.2405  0.2405 
C  0.2405  0.2786  0.2405  0.2405 
G  0.2405  0.2405  0.2786  0.2405 
T  0.2405  0.2405  0.2405  0.2786 

90
     A       C       G       T
A  0.269   0.2437  0.2437  0.2437 
C  0.2437  0.269   0.2437  0.2437 
G  0.2437  0.2437  0.269   0.2437 
T  0.2437  0.2437  0.2437  0.269  

100
     A       C       G       T
A  0.2627  0.2458  0.2458  0.2458 
C  0.2458  0.2627  0.2458  0.2458 
G  0.2458  0.2458  0.2627  0.2458 
T  0.2458  0.2458  0.2458  0.2627