Python for Bioinformatics: Matplotlib in OS X (5)

Wednesday, January 13, 2010

Matplotlib in OS X (5)

Still fooling around with matplotlib. This example is similar to the first one (here) but goes further, and computes the stuff we'd need for principal component analysis.

I'm still floundering a bit: it seems to be impossible or at least I haven't found out how to plot open circles. Pretty amazing, since that's the default in R. 'markerfacecolor' and 'mfc' as mentioned in the docs for pyplot don't work with ax.scatter. Also, all attempts to adjust the axes are ignored. And most important, there are surely PCA routines to do this directly.

In the first part we do something very similar to the first example, except that the matrix is constructed from a list of lists, properly, so that we don't need a call to reset the shape to a 2-row by x-col one. Also, in anticipation of part 2, I've changed the example so that the y-axis is the one with most of the variance. These points are plotted as salmon diamonds. Then, the first matrix of points is rotated by 45 degrees as before, and these points are plotted as cyan squares.

In the second part of the code, we use np.cov to compute the covariance of the rotated matrix, and then call np.linalg.eig on it, to compute the eigenvalues and eigenvectors. The eigenvalues are not guaranteed to be in any sorted order (who knows why?). To prove that the eigenvector is correct, we rotate the matrix again using that and plot those points as well. As expected, they are now spread along the x-axis.

The output (formatted by the function rs) is:

     35.27    3073.52
     36.36    3179.63       0.99

These are the variances of the x and y-points on line 1, followed by the two eigenvalues and the fraction of the total variance explained by the second (large) eigenvalue. As expected, this is about 99%.

import random, math, sys
import matplotlib.pyplot as plt
import numpy as np
random.seed(1357)

def rs(n,r=2): return str(round(n,2)).rjust(10)

r = range(-10,10)
R = range(-100,100)
xL1 = [random.choice(r) for i in range(30)]
yL1 = [random.choice(R) for i in range(30)]
print rs(np.var(xL1)), rs(np.var(yL1))
m1 = np.array([xL1,yL1])

z = 1.0/math.sqrt(2)
t = np.array([[z,-z],[z,z]])
xL2,yL2 = np.dot(t,m1)
m2 = np.array([xL2,yL2])
#---------------------------------------------
cm = np.cov(m2)
evals,evecs = np.linalg.eig(cm)
# The eigenvalues are not necessarily ordered
evals.sort()
print rs(evals[0]), rs(evals[1]),
print rs(evals[-1]/sum(evals))
xL3,yL3 = np.dot(evecs,m2)

fig = plt.figure()
ax = fig.add_subplot(111)
ax.scatter(xL1,yL1,s=150,color='salmon',marker='d')
ax.scatter(xL2,yL2,s=150,color='cyan',marker='s')
ax.scatter(xL3,yL3,s=150,color='b',marker='o')
plt.axes = ((-150,150,-150,150))
plt.savefig('example.png')