Python for Bioinformatics: Branch and Bound

Wednesday, March 24, 2010

Branch and Bound

As we explored in the last post, one of the big difficulties in phylogenetic analysis is that the number of possible trees is enormous. The various ways of approaching the problem of exploring a huge space of possibilities, too big to enumerate, is a large part of phylogenetic analysis. Felsenstein discusses one particular approach in Chapter 5---branch and bound.

This is a general technique for solving problems like the Traveling Salesman Problem (TSP), or in our case, evaluation of the maximum parsimony tree.

A simple version of TSP is the following. We pick N points in 2D space. We want to find a path that visits each point once, and covers the shortest distance. If there are N points, then there are N! possible paths.

We decide to generate all the permutations of the N points, and we have evaluated a few of these putative solutions. We keep a record of the path with the shortest distance seen so far.

Now suppose we're evaluating a new path. We're adding up the distances, and somewhere in the middle we discover that the distance for this path is already larger than the minimum. Since the current path cannot be a solution, we should bail immediately.

To explore this I'm going to run simulations using Python. The code below is the first of two parts. Here, we generate 6 x,y points and print their values. The first row is x and the second row is y:

0.142  0.748  0.38   0.98   0.547  0.498 
0.834  0.283  0.242  0.916  0.848  0.432

Since we're going to be using the distances between points repeatedly, we cache their values in a dictionary. Here are its keys and values:

(0, 1) 0.82
(0, 2) 0.64
(0, 3) 0.84
(0, 4) 0.4
(0, 5) 0.54
(1, 2) 0.37
(1, 3) 0.67
(1, 4) 0.6
(1, 5) 0.29
(2, 3) 0.9
(2, 4) 0.63
(2, 5) 0.22
(3, 4) 0.44
(3, 5) 0.68
(4, 5) 0.42

There is a simple function to try one random choice for the path and compute its total distance. We run that 5 times with these results:

2 4 5 1 0 3  
2.998
0 1 2 5 4 3  
2.268
5 2 4 1 0 3  
3.112
5 1 3 4 0 2  
2.446
0 5 2 3 4 1  
2.7

And we plot the points in order to visualize them better, as shown at the top. Next time we'll explore this further.

There is also a function I'm not using yet, that looks at a non-uniform distribution of points. Here's the code:

import time, math, sys
import numpy as np
import matplotlib.pyplot as plt
from itertools import permutations
np.random.seed(1357)

def euclid_distance(A,c1,c2):
    x1,y1 = A[:,c1]
    x2,y2 = A[:,c2]
    return np.sqrt((x1-x2)**2 + (y1-y2)**2)
    
def heavy_tails(N):
    L = list()
    for i in range(N):
        f = np.random.random()
        while 0.2 < f < 0.8:
            f = np.random.random()
        L.append(f)
    return L

def init(N=10):
    D = dict()
    if True:
        A = np.random.uniform(size=N*2)
    else:
        A = [np.random.random() for n in range(N)]
        L = heavy_tails(N)
        A = np.array(A + L)
    A.shape = (2,N)
    for L in A:
        L = [str(round(e,3)).ljust(5) for e in L]
        for e in L:  print e + ' ',
        print
    print 
    for c2 in range(A.shape[1]):
        for c1 in range(c2):
            D[(c1,c2)] = euclid_distance(A,c1,c2)
    return A,D
        
def show_distances(A,D):
    for k in sorted(D.keys()):
        print k, str(round(D[k],2))
    print
        
def plot(A,D):
    plt.scatter(A[0,:],A[1,:],s=350,
                facecolor='0.8')
    for c in range(A.shape[1]):  
        plt.text(A[0,c],A[1,c],s=c,
                 fontsize=18,
                 ha='center',va='center')
    ax = plt.axes()
    ax.set_xlim(-0.02,1.02)
    ax.set_ylim(-0.02,1.02)
    ax.grid(True)
    plt.savefig('example.png')
    
def repr(L,dist):
    rL = [' '.join([str(n) for n in L]) + '  ']
    rL.append(str(round(dist,3)))
    return '\n'.join(rL)
    
def try_random(N,D):
    L = range(N)
    np.random.shuffle(L)
    return repr(L,total_distance(L,D))

def total_distance(L,D,n=None):
    d = 0
    for first,next in zip(L[:-1],L[1:]):
        if first < next:
            d += D[(first,next)]
        else:
            d += D[(next,first)]
        if n and d > n:
            return None
    return d

def partOne():
    N = 6
    A,D = init(N)
    show_distances(A,D)
    plot(A,D)
    for i in range(5):  
        result = try_random(N,D)
        print result

partOne()