Python for Bioinformatics: Visualizing UPGMA clustering

Thursday, April 1, 2010

Visualizing UPGMA clustering

This post is a continuing exploration of UPGMA from a post the other day, using Python. To make it easy to visualize, the data are points in two dimensions, and the distances are euclidean.

The example code contains a Point class with a weight attribute, which is the number of points "contained" within it. The primary data points have weight 1, the combination of two primary points has weight 2, and so on.

The cluster function finds the two points that are the closest together, and averages them together, using the weights.

The plot is shown above with the primary data points in blue. I added the arrows afterward to indicate which points have been averaged. You can clearly see the effect of weighting on points 'H' and 'I'.

One thing I'm puzzling about is the conflict between the weighting that I'm doing (as described by Felsenstein) and the U in the name UPGMA. I'll have to look into this to check if I'm really doing it correctly.

[UPDATE: Professor Felsenstein has an extensive comment to this post and notes that the method used here is not UPGMA but another one. The problem is that I calculate the weighted average of (x,y) for a set of points and then compute the distance to the resulting virtual point. Although this is similar to the average of the distances when the targets lie roughly in the same direction, one can easily imagine a situation in which the results are wildly different (as he details).

I should modify this code to keep a list of its component points and do a real UPGMA calculation. I'll try to remember to do that. Thanks for reading!]

import string
import numpy as np
import matplotlib.pyplot as plt
UC = list(string.uppercase)
    
class Point:
    def __init__(self,x,y,s=None,w=None):
        self.x = float(x)
        self.y = float(y)
        if not s:
            self.name = UC.pop(0)
        else:
            self.name = s
        # weight = number of points included
        if not w:
            self.w = 1
        else:
            self.w = w
    def dist(self,b):
        ssq = (self.x-b.x)**2 + (self.y-b.y)**2
        return np.sqrt(ssq)
    def __repr__(self):
        pos = ','.join(['x = '+str(self.x),
                       ' y = '+str(self.y),
                       ' w = '+str(self.w)])
        return self.name + ' : ' + pos
        
def init():
    L = list()
    X = [8,18,30,30,3]
    print 'X', X, sum(X)/len(X)
    Y = [5,18,10,15,10]
    print 'Y', Y, sum(Y)/len(Y)
    for x,y in zip(X,Y):
        p = Point(x,y)
        L.append(p)
    return L

def plot(L,col='b'):
    X = [p.x for p in L]
    Y = [p.y for p in L]
    plt.scatter(X,Y,s=250,color=col)
    
def show(L):
    for p in L:  print p
    print

def closest(L):
    dist,p1,p2 = 1e6,-1,-1
    for i in range(len(L)):
        for j in range(i):
            d = L[i].dist(L[j])
            if d < dist:
                dist,p1,p2 = d,i,j
    return p1,p2
    
def cluster(L):
    i,j = closest(L)
    if i > j:  i,j = j,i
    p1,p2 = L[i],L[j]
    print 'cluster'
    print p1
    print p2
    print '-'*20
    rL = L[:i] + L[i+1:j] + L[j+1:]
    x = (p1.x*p1.w + p2.x*p2.w)/(p1.w+p2.w)
    y = (p1.y*p1.w + p2.y*p2.w)/(p1.w+p2.w)
    w = p1.w + p2.w
    p = Point(x,y,w=w)
    rL.append(p)
    return rL,p
        
pL = init()
L = pL[:]
show(L)
plot(pL)
c = 0
colors = ['red','red','magenta','maroon']

while len(L) > 1:
    L,p = cluster(L)
    plot([p],colors[c])
    c += 1
    show(L)
    
ax = plt.axes()
ax.set_xlim(0,32)
ax.set_ylim(0,22)
plt.savefig('example.png')