## Saturday, November 20, 2010

### Likelihood of an evolutionary tree 2

As usual, to improve my understanding of the likelihood calculation, I wrote some code. The first module is called `likelihood.py`, and it defines three functions for processing nodes, depending on the nature of their child nodes (both external, both internal, or mixed).

It's highly commented and should be self-explanatory after the previous post (here). To keep things simple, I have not implemented variable branch lengths yet.

The problem I'm having is testing. The version that I'm actually developing has lots of statements like: `if debug: print ..`, where I've been rigid about putting the print on an indented line below the `if`. (I've stripped all of those out in the listing below). I went through the printout trying to verify the results of each calculation manually, and it's pretty exhausting. What would be nice is to have another ML program to test the output against, but without implementing the branch lengths properly, I'm not sure I can do that.

This module itself is sort of boring. I wrote another module that exercises it a lot more, but I'm not quite ready to show that yet.

The code to strip out debug statements is short and sweet:

 `data = tu.load_data(fn)data = data.strip().split('\n')def lwspace(line): return len(line) - len(line.lstrip())flag = Falsetarget = 'debug:'indent = 0rL = list() for line in data: if 'debug:' in line: assert 'if' in line flag = True indent = lwspace(line) print '*' + line continue if lwspace(line) <= indent: flag = False if flag: print '*' + line else: rL.append(line) print ' ' + line`

`likelihood.py`

 `import math# list printingdef printL(L,name=None): if name: print name pL = [str(round(n,3)).rjust(8) for n in L] N = 6 while pL: line = pL[:N] pL = pL[N:] print ''.join(line)# to start with, we'll assume equal branch lengths# construct the transition-probability matrixnt = 'ACGT'pur = 'AG'pyr = 'CT'f_same = math.log(0.95)f_transition = math.log(0.03)f_transversion = math.log(0.01)pi = math.log(0.25)def get_f(n): m,n = list(n) if m == n: return f_same if m in pur and n in pur: return f_transition if m in pyr and n in pyr: return f_transition return f_transversionk = [m+n for m in nt for n in nt]v = [get_f(n) for n in k]P = dict(zip(k,v))#---------------------------------------# three different functions # depending on types of child nodes def ext_ext(u,v): # u,v are nucleotide as char # for each possible n in their parent node # returns an array of log(p) for nu and nv rL = [P[n+u] + P[n+v] for n in nt] return rL# L = list of likelihoods in order, for internal child# v = external child nucleotide as chardef int_ext(L,v): rL = list() # for new node in {ACGT} for i,m in enumerate(nt): ep = P[m+v] # log(p) of m -> v sL = list() # each state for internal child # n is a float for j,n in enumerate(L): u = nt[j] # would multiply probs, so add logs p = P[m+u] # log(p) for next branch p += L[j] # log likelihood if that child = u p += ep # log(p) for external child sL.append(p) # will need the actual probs to add them # does this need to be done better? how? sL = [(math.e)**p for p in sL] logS = math.log(sum(sL)) rL.append(logS) return rL # both children are internaldef int_int(L1,L2,root=False): rL = list() # for new node = {ACGT} for i,n in enumerate(nt): sL = list() # each state for left child # v1 is a float for j,f1 in enumerate(L1): u = nt[j] # each state for right child # v2 is a float for k,f2 in enumerate(L2): v = nt[k] p = P[n+u] # log(p) for left branch p += f1 # log likelihood if child = u p += P[n+v] # log(p) for right branch p += f2 # log likelihood if child = v sL.append(p) # will need the actual probs to add them sL = [(math.e)**p for p in sL] logS = math.log(sum(sL)) rL.append(math.log(sum(sL))) if root: # do pi calculation rL = [e + pi for e in rL] return rL if __name__ == '__main__': #for k in sorted(P.keys()): print k, P[k] #for u in nt: #for v in nt: #ext_ext(u,v) #print '-'*40 L = [-0.1] * 4 int_ext(L,'C') print '-'*40 int_int(L,L)`