Thursday, July 23, 2009

Phylogenetic Trees: node positions

In this post, we'll continue working with the tree from last time. The data we recovered for the tree looks like this:

external nodes:
E1 Stenotrophomonas_maltophilia 0.07574
E2 Kingella_oralis 0.08026
E3 Pseudomonas_aeruginosa 0.05950
E4 Salmonella_typhi 0.01297
E5 Escherichia_coli 0.01491
E6 Haemophilus_parainfluenzae 0.06113

((E1,E2):0.00827,E3,((E4,E5):0.03356,E6):0.03863);

internal nodes:
I1 E1 E2 0.00827
I2 E4 E5 0.03356
I3 I2 E6 0.03863
I4 I1 E3 I3 0


It's in a file on disk called 'tree.data.txt'. The next step is to load it back into memory and organize it into two dictionaries, one for external and one for internal nodes. We also make a "reverseTree" dictionary where subnodes point to their parents. At the end of the initial phase, we can print the contents to the screen to verify things are OK.


E5 {'dist': '0.01491', 'name': 'Escherichia_coli'}
E4 {'dist': '0.01297', 'name': 'Salmonella_typhi'}
E6 {'dist': '0.06113', 'name': 'Haemophilus_parainfluenzae'}
E1 {'dist': '0.07574', 'name': 'Stenotrophomonas_maltophilia'}
E3 {'dist': '0.05950', 'name': 'Pseudomonas_aeruginosa'}
E2 {'dist': '0.08026', 'name': 'Kingella_oralis'}
-----
I1 {'dist': '0.00827', 'subnodes': ['E1', 'E2']}
I3 {'dist': '0.03863', 'subnodes': ['I2', 'E6']}
I2 {'dist': '0.03356', 'subnodes': ['E4', 'E5']}
I4 {'dist': '0', 'subnodes': ['I1', 'E3', 'I3']}
-----
{'I1': 'I4', 'I3': 'I4', 'I2': 'I3', 'I4': 'root', 'E5': 'I2', 'E4': 'I2', 'E6': 'I3', 'E1': 'I1', 'E3': 'I4', 'E2': 'I1'}



Now, we need to do some work. Each node's x-position will be the sum of all the x-distances back to the root. And the y-position is an integer (1-6) for the external nodes.

For the internal nodes it's a little trickier. The y-position is the average of the y-positions of its subnodes (or the median, if there are three of them). We also need to remember the values for the extreme y-positions of the subnodes (for the vertical bars in the plot). We save these in a separate dictionary called zD.

When we're done, we print out the values for each node to check them over.

We have the node name, parent, x-position and y-position, label (e.g. Escherichia coli), and finally for internal nodes, the min and max of the subnode y positions. Looks good to me.

I1 I4 0.00827 1.5 I1 (1, 2)
I3 I4 0.03863 5.25 I3 (4.5, 6)
I2 I3 0.07219 4.5 I2 (4, 5)
I4 root 0.0 3.0 I4 (1.5, 5.25)
E5 I2 0.0871 5.0 Escherichia_coli
E4 I2 0.08516 4.0 Salmonella_typhi
E6 I3 0.09976 6.0 Haemophilus_parainfluenzae
E1 I1 0.08401 1.0 Stenotrophomonas_maltophilia
E3 I4 0.0595 3.0 Pseudomonas_aeruginosa
E2 I1 0.08853 2.0 Kingella_oralis


Before I started working on this problem earlier in the week, I remembered that I had posted about it before. But I couldn't find my code from the last time, (and I forgot that I'd posted that as well---another senior moment!) Anyway, this solution is much shorter and quite a bit simpler. The complex part is what we did today, working out the x and y positions of each node, as well as (next time) mapping the lines that connecting them, and finally coercing R into doing the plot.

The entire listing:

import sys

def readData(fn):
FH = open(fn)
data = FH.read()
FH.close()
return data.strip()

def loadData():
fn = 'tree.data.txt'
data = readData(fn)
t = data.split('\n\n')
eNodeD = dict()
iNodeD = dict()

L = t[0].split('\n')
for line in L[1:]:
eNode,name,dist = line.strip().split()
eNodeD[eNode] = { 'name':name,'dist':dist }

L = t[2].split('\n')
for line in L[1:]:
t = line.strip().split()
name = t.pop(0)
dist = t.pop()
iNodeD[name] = { 'dist':dist,'subnodes':t }
return eNodeD,iNodeD
#----------------------------------------------
def getReverseTree(eNodeD,iNodeD):
def super(s):
for k in iNodeD:
if s in iNodeD[k]['subnodes']:
return k
return 'root'
rT = dict()
for n in eNodeD: rT[n] = super(n)
for n in iNodeD: rT[n] = super(n)
return rT

def init(v=False):
def report():
for k in eNodeD:
print k, eNodeD[k]
print '-----'
for k in iNodeD:
print k, iNodeD[k]
print '-----'
print rT
eNodeD,iNodeD = loadData()
rT = getReverseTree(eNodeD,iNodeD)
if v: report()
return eNodeD,iNodeD,rT
#------------------------------------
def doPositions(eNodeD,iNodeD,rT):
# x pos are distance from root
xD = dict()
for k in eNodeD.keys() + iNodeD.keys():
if k in eNodeD:
x = float(eNodeD[k]['dist'])
else:
x = float(iNodeD[k]['dist'])
super = rT[k]
while super is not 'root':
x += float(iNodeD[super]['dist'])
super = rT[super]
xD[k] = x

# y pos for eNodes are numbered sequentially
yD = dict()
for i,k in enumerate(sorted(eNodeD.keys())):
yD[k] = i+1

# for internal nodes:
# y = average of y of direct descendants
# or the median
# must also keep extreme y's for lines
zD = dict()

for k in sorted(iNodeD.keys()):
L = iNodeD[k]['subnodes']
L = [yD[n] for n in L]
if len(L) == 3:
yD[k] = L[1]
else:
yD[k] = sum(L)*1.0/len(L)
zD[k] = min(L),max(L)
return xD,yD,zD
#----------------------------------------------
# see what we got
def second_report(xD,yD,zD,eNodeD,rT):
for k in xD:
print k, rT[k],
print round(xD[k],5),
print round(yD[k],2),
if k in eNodeD:
print eNodeD[k]['name']
else:
print k,
print zD[k]
#----------------------------------------------
if __name__ == '__main__':
eNodeD,iNodeD,rT = init(v=True)
xD,yD,zD = doPositions(eNodeD,iNodeD,rT)
second_report(xD,yD,zD,eNodeD,rT)