Python for Bioinformatics: Colors for heat maps

Sunday, August 3, 2008

Colors for heat maps

If you are a scientist, you know about heat maps. A heat map is a method for representing a number from a distribution as a color. Typically the numbers are in matrices and the colors come from a distribution like the rainbow, or red <=> black <=> green. Heat maps are a great way of visualizing trends in data, and work particularly well with cluster analysis.

However, the choice of red and green for contrasting colors in classic work like Eisen et al. is unfortunate in view of the high incidence of defects in color vision. If I read the Wikipedia article correctly, the total frequency of phenotypes with negative impacts on color perception approaches 10% in males, although it would depend on the population. Frequency is much higher in males due to X-linkage of genes for some proteins involved in color vision, including receptors.

Another issue is the nature of the data. For microarray data, we want to know who is down, who is up and who is unchanged. Other data may be frequencies, ranging from 0 to 100%. For this case, so called "topological colors" are a good choice. They consist of shades of blue from light to dark, then green to yellow, and finally yellow into straw. I am not sure who invented this distribution.

R has standard functions for plotting heatmaps and generating colors (e.g. rainbow and topo.colors), which one calls with integer argument for the number of colors required from the distribution. I got interested in generating heat maps from Python using topological colors, so I wanted to know more about the distribution. I generated a vector of colors and examined their hex RGB values. The trends for most of the distribution are clear, but the last part was not. So I decided to plot the values. Shown here are the red, green and blue components of each topo.colors value with n = 1000, compared to the actual color generated in R.

R is great at what it does easily, but I hate programming in it. I would rather use Perl, and that's saying a lot. Actually, it feels like a crazed hybrid between Perl and Lisp. I vowed not to spend another hour trying to figure out how to do something in R I can do in about 10 seconds in Python. Arghh... So the code I used to generate the plot is naturally, in Python:

import subprocess,os

# part 1:  write the R code, and call R 
L = [
'n = 1000',
'f = "topo.colors.txt"',
'v = topo.colors(n)',
'write.table(v,f)']
FH = open('.temp.Rcode.txt','w')
FH.write('\n'.join(L))
FH.close()

cmd = 'R CMD BATCH ' + os.getcwd()
cmd += '/' + '.temp.Rcode.txt'
p = subprocess.Popen(cmd, shell=True)
sts = os.waitpid(p.pid, 0)

# part 2:  basic processing of the R output

FH = open('topo.colors.txt','r')
# first line is the name of the vector
data = FH.read().strip().split('\n')[1:]
print len(data)
data = [e.replace('"','') for e in data]
# lose the '#' and the transparency
data = [e.split()[1][1:7] for e in data]
# print data[:1]
FH.close()

# part 3
# make dicts to convert hex values to 0..255

L = range(256)
L1 = [str(n) for n in L]

# add leading '0' to first 10 entries
modL = ['0' + e for e in L1[:10]]
L1 = modL + L1[10:]

L2 = [str(hex(i)[2:]) for i in L]
# add leading '0' to first 16 hex entries
modL = ['0' + e for e in L2[:16]]
L2 = modL + L2[16:]
L2 = [e.upper() for e in L2]

D = dict(zip(L,L2))
hD = dict(zip(L2,L))

# part 4:  separate red, green, blue
# write data to disk

L = [ [hD[e[:2]], hD[e[2:4]], 
           hD[e[4:6]]] for e in data]
L = [ [str(t[0]), str(t[1]), str(t[2])] for t in L]
L = [(',').join(e) for e in L]
FH = open('topo.colors.mod.txt','w')
FH.write('\n'.join(L))
FH.close()

# part 5
'''
open R and paste:
setwd('Desktop')
v = read.csv('topo.colors.mod.txt',head=F)
v[1:10,]
x=1
plot(x,type="n",xlim=c(0,1000),ylim=c(0,256))
points(v[,1],col='red',cex=0.5,pch=16)
points(v[,2],col='green',cex=0.5,pch=16)
points(v[,3],col='blue',cex=0.5,pch=16)
'''