Python for Bioinformatics: Eigenvalues, part 3

Thursday, July 9, 2009

Eigenvalues, part 3

In this post, I use R to explore a simple PCA analysis of a set of observations in 2-dimensions. We start by grabbing some numbers in the form of an elongated shape:

x = rnorm(1000,0,1)
y = rnorm(1000,0,4)
M = cbind(x,y)
X = c(-12,12)  # plot limits

plot(M,
  xlim=X,ylim=X,main='M',
  xlab='x',ylab='y')

To make things more interesting, we rotate the matrix of values by 45 degrees:

s = 1/sqrt(2)
T = matrix(c(s,s,-s,s),nrow=2)
T

          [,1]       [,2]
[1,] 0.7071068 -0.7071068
[2,] 0.7071068  0.7071068

M2 = M %*% T
plot(M2,
  xlim=X,ylim=X,main='M2',
  xlab='x',ylab='y')

Using the eigen and cov functions in R we find:

E = eigen(cov(M2))
E

$values
[1] 16.5603206  0.9731755

$vectors
          [,1]       [,2]
[1,] 0.7067687 -0.7074447
[2,] 0.7074447  0.7067687

If we use the E$vectors to rotate the matrix we analyzed (M2), we get:

M3 = M2 %*% E$vectors
plot(M3,
  xlim=X,ylim=X,main='M3',
  xlab='x',ylab='y')

So, I hope you can see what PCA does. It takes the ellipsoid shape of M2's points, and rotates them so that the maximum variance is along the x axis.

The first eigenvector of the covariance matrix constructed from M2 identifies the direction in which the variance is the maximum, and the first eigenvalue (E$values[1]) is this variance.

Finally, we use R's implementation of principal component analysis:

princomp(M2)

Call:
princomp(x = M2)

Standard deviations:
  Comp.1   Comp.2 
4.145496 1.005096 

 2  variables and  1000 observations.

Notice that the standard deviations match what we started with.