Python for Bioinformatics: Unifrac (7): understanding PCoA---setup

Thursday, February 11, 2010

Unifrac (7): understanding PCoA---setup

I'm exploring Unifrac (first and second posts).

One of the nice aspects of UniFrac is the "multivariate analysis" it can do including PCoA (Principal Coordinates Analysis). I'm trying to understand how this works, and I have two posts to go through it. This first post is similar to stuff we've done before (e.g. here), but I want to try to present a coherent picture. Sorry for the repetition. The code is presented in two parts overall, and then today's section has two parts also. We begin by getting a bunch of integers in x-coordinates over the range from -20 to +20, and in y-coordinates randomly between -5 and +5. We turn those into a 2D array and plot them using matplotlib as red circles.

A
[[-20 -16 -12  -8  -4   0   4   8  12  16]
 [ -2   1   2   2  -1   1  -1  -3  -2   5]]

Next, the array is rotated through 45 degrees and replotted (blue circles).

In the second part (of the first half), we use standard Principal Component Analysis (PCA) on these data points to recover their long dimension (now rotated) back to the x-axis. We calculate the covariance matrix for these two sets of points, and then do np.linalg.eigh(m) to get the eigenvalues and eigenvectors.

cov
[[ 74.08888889  70.35555556]
 [ 70.35555556  78.53333333]]
eigenvalues
[   5.9204692   146.70175303]
eigenvectors
[[-0.71818168  0.69585564]
 [ 0.69585564  0.71818168]]

The eigenvalues come out with the largest one last. We grab the first eigenvector (the last in the array we received), and do

np.dot(ev1,B)

which gives back the x-values we started with.

[-20. -16. -12.  -8.  -4.   0.   4.   8.  12.  16.]

We can do the same with the second eigenvector to get the y-values. We can combine these two operations by re-ordering the eigenvectors:

print 'C'
T2 = np.array([ev1,ev2])
C = np.dot(T2,B)
for row in C:
    print np.round(row)
print

C
[-20. -16. -12.  -8.  -4.   0.   4.   8.  12.  16.]
[-2.  1.  2.  2. -1.  1. -1. -3. -2.  5.]

In the figure, the C array has been plotted as black cross-bars, which match the original red circles pretty well. Next time, we'll take the same data points and look at them from the perspective of PCoA. We'll also see how PyCogent fares with our data, and how reasonable a picture it can estimate of what we started with.

Here's the code:

import numpy as np
from pylab import *
from cogent.cluster import metric_scaling
plotting = len(sys.argv) > 1

np.random.seed(1357)
x = range(-20,20,4)
y = np.random.random_integers(-5,5,size=len(x))
A = np.array([x,y])
N = len(x)
A.shape=(2,N)
print 'A\n', A

n = 1.0/np.sqrt(2)
T = np.array([[n,-n],[n,n]])
B = np.dot(T,A)

if plotting:
    scatter(A[0],A[1],s=250,color='r')
    scatter(B[0],B[1],s=250,color='b')
        
#============================================

m = np.cov(B)
print 'cov\n', m
e = np.linalg.eigh(m)
print 'eigenvalues\n', e[0]
print 'eigenvectors\n', e[1]

ev1 = e[1][1]
print np.round(np.dot(ev1,B))
ev2 = e[1][0]
print np.round(np.dot(ev2,B))
print

print 'C'
T2 = np.array([ev1,ev2])
C = np.dot(T2,B)
for row in C:
    print np.round(row)
print

if plotting:
    scatter(C[0],C[1],s=100,c='k',marker='+')

R = (-25,25)
xlim(R)
ylim(R)
savefig('plot.png')