Python for Bioinformatics: Clustering with hclust from Python

Sunday, November 21, 2010

Clustering with hclust from Python

As long as I have RPy working, I did a first version of a clustering script. First we do our imports and grab the definitions of R objects:

import sys, random
import matplotlib.pyplot as plt
import numpy as np
from rpy2 import robjects
import hm_utils as hm
random.seed(153)

matrix = robjects.r['matrix']
hclust = robjects.r['hclust']
dist = robjects.r['dist']

Next we generate a numpy array with random values in the set {0.5 .. 1.0}. Then we pick a few rows and mark a few of the same columns in each by changing each column's value to a random one between 0.0 .. 0.2. These show up as the light rectangles in the plot. Then we just:

# convert to R 'matrix' and cluster
A = list(hm.flatten(A))
M = matrix(A,nrow=R)
d = dist(M, method='euclidean')
result = hclust(d,method='average')
o = result[2]     # order

The only hard part is going back and forth between a Python array and R matrix. There's probably a better way to do it, but what I used was flatten from here before feeding to R, and this going the other way:

def to_array(M,R,C):
    A = np.array(list(M))
    A.shape = R,C
    return A

We plot a heat map using code from here. As you can see in the graphic at the top of the post, hclust does a reasonable job of finding the structure we created. Rather than list the code, I put a zipped folder with the files on Dropbox here.