Saturday, November 20, 2010

Likelihood of an evolutionary tree 2

As usual, to improve my understanding of the likelihood calculation, I wrote some code. The first module is called, and it defines three functions for processing nodes, depending on the nature of their child nodes (both external, both internal, or mixed).

It's highly commented and should be self-explanatory after the previous post (here). To keep things simple, I have not implemented variable branch lengths yet.

The problem I'm having is testing. The version that I'm actually developing has lots of statements like: if debug: print .., where I've been rigid about putting the print on an indented line below the if. (I've stripped all of those out in the listing below). I went through the printout trying to verify the results of each calculation manually, and it's pretty exhausting. What would be nice is to have another ML program to test the output against, but without implementing the branch lengths properly, I'm not sure I can do that.

This module itself is sort of boring. I wrote another module that exercises it a lot more, but I'm not quite ready to show that yet.

The code to strip out debug statements is short and sweet:

data = tu.load_data(fn)
data = data.strip().split('\n')

def lwspace(line):
return len(line) - len(line.lstrip())

flag = False
target = 'debug:'
indent = 0
rL = list()

for line in data:
if 'debug:' in line:
assert 'if' in line
flag = True
indent = lwspace(line)
print '*' + line
if lwspace(line) <= indent:
flag = False
if flag:
print '*' + line
print ' ' + line

import math
# list printing
def printL(L,name=None):
if name: print name
pL = [str(round(n,3)).rjust(8) for n in L]
N = 6
while pL:
line = pL[:N]
pL = pL[N:]
print ''.join(line)

# to start with, we'll assume equal branch lengths
# construct the transition-probability matrix
nt = 'ACGT'
pur = 'AG'
pyr = 'CT'
f_same = math.log(0.95)
f_transition = math.log(0.03)
f_transversion = math.log(0.01)
pi = math.log(0.25)

def get_f(n):
m,n = list(n)
if m == n: return f_same
if m in pur and n in pur: return f_transition
if m in pyr and n in pyr: return f_transition
return f_transversion

k = [m+n for m in nt for n in nt]
v = [get_f(n) for n in k]
P = dict(zip(k,v))

# three different functions
# depending on types of child nodes

def ext_ext(u,v):
# u,v are nucleotide as char
# for each possible n in their parent node
# returns an array of log(p) for nu and nv
rL = [P[n+u] + P[n+v] for n in nt]
return rL

# L = list of likelihoods in order, for internal child
# v = external child nucleotide as char
def int_ext(L,v):
rL = list()
# for new node in {ACGT}
for i,m in enumerate(nt):
ep = P[m+v] # log(p) of m -> v
sL = list()
# each state for internal child
# n is a float
for j,n in enumerate(L):
u = nt[j]
# would multiply probs, so add logs
p = P[m+u] # log(p) for next branch
p += L[j] # log likelihood if that child = u
p += ep # log(p) for external child

# will need the actual probs to add them
# does this need to be done better? how?
sL = [(math.e)**p for p in sL]
logS = math.log(sum(sL))
return rL

# both children are internal
def int_int(L1,L2,root=False):
rL = list()
# for new node = {ACGT}
for i,n in enumerate(nt):
sL = list()
# each state for left child
# v1 is a float
for j,f1 in enumerate(L1):
u = nt[j]
# each state for right child
# v2 is a float
for k,f2 in enumerate(L2):
v = nt[k]
p = P[n+u] # log(p) for left branch
p += f1 # log likelihood if child = u
p += P[n+v] # log(p) for right branch
p += f2 # log likelihood if child = v

# will need the actual probs to add them
sL = [(math.e)**p for p in sL]
logS = math.log(sum(sL))
if root:
# do pi calculation
rL = [e + pi for e in rL]
return rL

if __name__ == '__main__':
#for k in sorted(P.keys()): print k, P[k]
#for u in nt:
#for v in nt:
#print '-'*40
L = [-0.1] * 4
print '-'*40