Thursday, April 1, 2010

Visualizing UPGMA clustering


This post is a continuing exploration of UPGMA from a post the other day, using Python. To make it easy to visualize, the data are points in two dimensions, and the distances are euclidean.

The example code contains a Point class with a weight attribute, which is the number of points "contained" within it. The primary data points have weight 1, the combination of two primary points has weight 2, and so on.

The cluster function finds the two points that are the closest together, and averages them together, using the weights.

The plot is shown above with the primary data points in blue. I added the arrows afterward to indicate which points have been averaged. You can clearly see the effect of weighting on points 'H' and 'I'.

One thing I'm puzzling about is the conflict between the weighting that I'm doing (as described by Felsenstein) and the U in the name UPGMA. I'll have to look into this to check if I'm really doing it correctly.

[UPDATE: Professor Felsenstein has an extensive comment to this post and notes that the method used here is not UPGMA but another one. The problem is that I calculate the weighted average of (x,y) for a set of points and then compute the distance to the resulting virtual point. Although this is similar to the average of the distances when the targets lie roughly in the same direction, one can easily imagine a situation in which the results are wildly different (as he details).

I should modify this code to keep a list of its component points and do a real UPGMA calculation. I'll try to remember to do that. Thanks for reading!]


import string
import numpy as np
import matplotlib.pyplot as plt
UC = list(string.uppercase)

class Point:
def __init__(self,x,y,s=None,w=None):
self.x = float(x)
self.y = float(y)
if not s:
self.name = UC.pop(0)
else:
self.name = s
# weight = number of points included
if not w:
self.w = 1
else:
self.w = w
def dist(self,b):
ssq = (self.x-b.x)**2 + (self.y-b.y)**2
return np.sqrt(ssq)
def __repr__(self):
pos = ','.join(['x = '+str(self.x),
' y = '+str(self.y),
' w = '+str(self.w)])
return self.name + ' : ' + pos

def init():
L = list()
X = [8,18,30,30,3]
print 'X', X, sum(X)/len(X)
Y = [5,18,10,15,10]
print 'Y', Y, sum(Y)/len(Y)
for x,y in zip(X,Y):
p = Point(x,y)
L.append(p)
return L

def plot(L,col='b'):
X = [p.x for p in L]
Y = [p.y for p in L]
plt.scatter(X,Y,s=250,color=col)

def show(L):
for p in L: print p
print

def closest(L):
dist,p1,p2 = 1e6,-1,-1
for i in range(len(L)):
for j in range(i):
d = L[i].dist(L[j])
if d < dist:
dist,p1,p2 = d,i,j
return p1,p2

def cluster(L):
i,j = closest(L)
if i > j: i,j = j,i
p1,p2 = L[i],L[j]
print 'cluster'
print p1
print p2
print '-'*20
rL = L[:i] + L[i+1:j] + L[j+1:]
x = (p1.x*p1.w + p2.x*p2.w)/(p1.w+p2.w)
y = (p1.y*p1.w + p2.y*p2.w)/(p1.w+p2.w)
w = p1.w + p2.w
p = Point(x,y,w=w)
rL.append(p)
return rL,p

pL = init()
L = pL[:]
show(L)
plot(pL)
c = 0
colors = ['red','red','magenta','maroon']

while len(L) > 1:
L,p = cluster(L)
plot([p],colors[c])
c += 1
show(L)

ax = plt.axes()
ax.set_xlim(0,32)
ax.set_ylim(0,22)
plt.savefig('example.png')

1 comment:

Joe Felsenstein said...

Just noticed your UPGMA post. I agree that the U in UPGMA, which stands for Unweighted, seems hard to reconcile with the use of weighted averages. I think Sokal and Sneath were thinking of it from the point of view of the species (or whatever the things being clustered are). The weights on groups give equal weight to each species within them.

I also wonder whether your diagram really shows what UPGMA does. When a pair of objects are clustered, the distance to the cluster is the average of the distances to its members. But your diagram shows the distance being to the average of the points in the cluster. Distance to averages aren't averages of distances. Consider two points, one a mile north of me, one a mile south of me. The distance to their average is zero (as the average is right on top of me), but the average of distances is 1 mile.

I think the clustering method your diagram describes is called Centroid Linkage, and is not the same as UPGMA.