Python for Bioinformatics: Checking our test data in NJ using Phylip

Wednesday, November 24, 2010

Checking our test data in NJ using Phylip

I'd like to compare the results of my own neighbor-joining script to standard methods such as Phylip, ape, and PyCogent. In initial comparisons, ~~I noticed some small differences among those methods themselves with the data from before~~ (UPDATE: I just failed to synch the data properly). (here). I also noticed that in Phylip's input screen for neighbor, they give the option of :

Outgroup root?  No, use as outgroup species  1
Randomize input order of species?  No. Use input order

I thought it would be interesting to see if there are any differences observed in the tree if we use a different input order of species. We could use Python to rewrite the data file to change to a specific order, but as a simpler test we'll just let Phylip randomize it for us. After getting the kinks out, we do 50 reps, calling neighbor as a process (just like here). One wrinkle: the process returns faster than the file write from Phylip, so we have to wait a second before trying to do anything with the files ourselves.

The data for Phylip look like this (note requirement for 10 characters on each line preceeding the distances, and a count for the otus on line 1):

     6
A         0   5   4   7   6   8
B         5   0   7  10   9  11
C         4   7   0   7   6   8
D         7  10   7   0   5   8
E         6   9   6   5   0   8
F         8  11   8   8   8   0

As a bonus, I asked PyCogent to compare the topology of all the resulting trees, but no differences were found. And then, we plot the first one with ape.

import os, subprocess, time, sys, random
from cogent import LoadTree

prefix = '/Users/telliott_admin/Desktop/'
controlfn = prefix + 'responses.txt'
datafn = prefix + 'nj_data.phylip.txt'
resultfn = prefix + 'outfile'
treefn = prefix + 'outtree'

def one_run(n):
    n = str(n)
    FH = open(controlfn, 'w')
    FH.write(datafn + '\n')
    FH.write('J\n')
    seed = random.choice(range(2,200))
    if not seed % 2: seed += 1
    FH.write(str(seed) + '\n')
    FH.write('Y\n')
    FH.close()
    cmd = 'neighbor < ' + controlfn + ' > '
    cmd += 'screenout &'
    p = subprocess.Popen(cmd, shell=True)
    pid,ecode = os.waitpid(p.pid, 0)
    print ecode
    time.sleep(1)
    os.rename(resultfn,prefix+'results/outfile' + n + '.txt')
    os.rename(treefn,prefix+'results/outtree' + n + '.txt')
    os.remove(prefix + 'screenout')

N = 50
for i in range(1,N+1):  one_run(i)

L = list()
for i in range(1,N+1):
    tr = LoadTree(prefix + 'results/outtree' + str(i) + '.txt')
    L.append(tr)
    
for i in range(N):
    for j in range(N):
        if i == j:
            continue
        if not L[i].sameTopology(L[j]):
            print L[i]
            print L[j]
        else:
            print '.',
        
'''
R code:
library(ape)
setwd('Desktop')
tr = read.tree('results/outtree1.txt')
plot(tr,edge.width=3,cex=2,type='unrooted')
axisPhylo()
'''