Python for Bioinformatics: Clustering with UPGMA

Tuesday, March 30, 2010

Clustering with UPGMA

While there is still much more to talk about with respect to models of sequence of evolution, I'm going to introduce a new topic in phylogenetics today: distance methods of building a tree. The classic example of this was "popularized" by Sokal and Sneath (according to Felsenstein, p. 161). It's called UPGMA (unweighted pair-group method with arithmetic mean). This is a clustering algorithm that uses an average linkage method, and when applied to a set of distances between objects it gives a rooted tree.

Suppose we want to compute the distance between two clusters of objects (ABC and DE). We tally the distances between each pair of items (AD, AE, BE, BE, CD, CE) and average them.

To show the method in practice, let's start with the answer---that usually makes things a lot easier. In the tree at the top, we have six species, OTUs, or just objects listed as A through F in the figure, and also the indicated tree showing the paths by which the OTUs have diverged from each other. Distance on the tree is indicated as different chunks of a unit distance. We derive the following table of distances:

     A    B    C    D    E    F
A    -
B    2    -
C    4    4    -
D    6    6    6    -
E    6    6    6    4    -
F    8    8    8    8    8    -

The method is simply to choose the pair of nodes that are closest, and average them together to construct a new node, and then compute the distance of the new node from all the others. We remove the original nodes from the table.

Here, A and B are closest, so we'll form the new node AB from them. (We remember the distance from A to B was 2). The distance from AB to the other nodes is the average of the distances for A and B. This is the same for all, so we obtain:

    AB    C    D    E    F
AB   -
C    4    -
D    6    6    -
E    6    6    4    -
F    8    8    8    8    -

We can build the tree corresponding to the first two lines of what we just accomplished (AB,C):

For the next step, we'll break a tie by deciding that D and E are closest, so we collapse them into a single node (remembering that the distance from D to E was 4).

    AB    C   DE    F
AB   -
C    4    -
DE   6    6    -
F    8    8    8    -

Now, we need to combine AB and C. For this case, we need to weight the distances computed for the new node using the arithmetic mean of the distance to each component (ABC to DE). However, in this example, it doesn't matter, since they are all the same.

     ABC  DE   F
ABC   -
DE    6    -
F     8    8   -

Finally, we group ABC with DE.

     ABCDE  F
ABCDE  -
F      8    -

The method assumes a molecular clock (that the true distances between the OTUs are proportional to the observed distances between them), which is always the case for a simple clustering application, and may be a reasonable assumption for closely related species.

Felsenstein has a nice (somewhat more complicated) example based on immunological distances (Sarich 1969 PMID 5373919). The reference is from him---as for many older papers, our crappy library does not provide access, so I have not read it. An important reason to publish in "free" journals.

In the code below, the plot of hclust isn't quite working as I'd like. So, instead (looking ahead) I'll show the neighbor-joining tree from this data.

That results in an unrooted tree.

Here is the R code:

m = c(0,2,4,6,6,8,
      2,0,4,6,6,8,
      4,4,0,6,6,8,
      6,6,6,0,4,8,
      6,6,6,4,0,8,
      8,8,8,8,8,0)
dim(m)=c(6,6)
colnames(m)=c(
  'A','B','C','D','E','F')
rownames(m)=colnames(m)
d = as.dist(m)

plot(hclust(d))

library(ape)
tr=nj(d)
colors=c('blue','blue','darkgreen',
  'red','red','maroon')
plot(tr,cex=2,
  tip.color=colors)
axisPhylo()