Python for Bioinformatics: Unifrac (8): understanding PCoA---calculate

Thursday, February 11, 2010

Unifrac (8): understanding PCoA---calculate

I'm exploring Unifrac (first and second posts).

In the previous post, I showed code and some output related to our topic for this post. You'll need to generate those numbers to follow along. Of course, you could just paste 'em in:

[[-20 -16 -12  -8  -4   0   4   8  12  16]
 [ -2   1   2   2  -1   1  -1  -3  -2   5]]

The whole trick to PCoA is that it can take a "measure of association" between two samples, like a distance. So, let's use the values from before, calculate the distances between them, and then see how well PCoA recovers the original data. The second half of the code listing is at the bottom of the post. In the first segment of this second half, we simply define two functions to calculate the distances between all the points, using either the "manhattan" distance or the "euclidean" distance. We will only use the euclidean version for this analysis. If you want to see the results of the distance calculation, you'll have to run the code (manhattan) or modify it to print the euclidean result.

Next, we want to do the PCoA analysis. I did this two ways (to check myself). First, I looked under the hood at the PyCogent module cogent.cluster.metric_scaling, and I call the same functions used in metric_scaling.principal_coordinates_analysis. These include the following two functions as well as a numpy.linalg routine:

E = metric_scaling.make_E_matrix(eucl)
F = metric_scaling.make_F_matrix(E)
eigvals, eigvecs = eigh(F)

followed by a little massaging of the vectors to generate the output. As you will see if you run the code, the eigenvalues and then the last two rows of my result are:

0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 7.3 36.34
...
[ 1.9 -1.  -2.  -1.9  1.2 -0.8  1.3  3.4  2.4 -4.5]
[-18.  -14.  -10.   -6.   -2.    2.    6.    9.9  14.   18.1]

The other approach is to call PyCogent's function directly. The eigenvalues and the last two results of the call to

pca = metric_scaling.principal_coordinates_analysis
point_matrix, eigvals = pca(eucl)

are

-0.0 -0.0 -0.0 -0.0 -0.0 0.0 0.0 0.0 53.28 1320.32
...
[ 1.9 -1.  -2.  -1.9  1.2 -0.8  1.3  3.4  2.4 -4.5]
[-18.  -14.  -10.   -6.   -2.    2.    6.    9.9  14.   18.1]

This looks pretty good, though I notice that my eigenvalues aren't matching. I'm missing a square root somewhere. I will postpone worrying about this for another day.

>>> 1320.32/53.28
24.780780780780781
>>> (36.34/7.3)**2
24.781302308125362

The original matrix is:

[[-20 -16 -12  -8  -4   0   4   8  12  16]
 [ -2   1   2   2  -1   1  -1  -3  -2   5]]

I think this is pretty remarkable. We've taken a matrix of distances between our sample points, done a little PCoA mumbo-jumbo, and regenerated something pretty close to the original.

However, I am a bit worried at what look to be bugs in the implementation:

• the original x-values are systematically transformed by + 2
• the original y-values are systematically transformed by * -1

This gives a shift and a reflection of the points, but the relationships revealed by the PCoA are identical to the original.

[UPDATE: <whacks self on head> Duh...I bet that in view of the last line above, if we calculate the distance we'll get the same result as we did for the original matrix. So, of course, there's no reason that PCoA could possibly tell the difference. Presumably, it's just centered its result a little better around (0,0). ]

#============================================

def delta(n1,n2): return abs(n2-n1)

def manhattan_dist(x1,y1,x2,y2):
    return delta(x1,x2) + delta(y1,y2)
    
def euclidean_dist(x1,y1,x2,y2):
    sumsq = delta(x1,x2)**2 + delta(y1,y2)**2
    return np.sqrt(sumsq)
    
eucl = list()
manh = list()
for i in range(len(x)):
    for j in range(len(x)):
        x1,y1 = A[0][i], A[1][i]
        x2,y2 = A[0][j], A[1][j]
        manh.append(manhattan_dist(x1,y1,x2,y2))
        eucl.append(euclidean_dist(x1,y1,x2,y2))

manh = np.array(manh)
manh.shape = (N,N)
eucl = np.array(eucl)
eucl.shape = (N,N)
print 'manhattan\n', manh, '\n'
#============================================
# PyCogent stuff starts here:

# I call their functions
E = metric_scaling.make_E_matrix(eucl)
F = metric_scaling.make_F_matrix(E)
eigvals, eigvecs = eigh(F) 
t_eigvecs = eigvecs.transpose()

print 'eigvals'
for e in eigvals:  print round(e,2),
print
print 't_eigvecs'
for row in t_eigvecs:
    print np.round(row,decimals=1)

eigval_root = np.sqrt(abs(eigvals))
print 'eigval_root\n'
for e in eigval_root:  print round(e,2),
print

final = t_eigvecs * eigval_root[:,np.newaxis]
print 'my calculation'
for row in final:
    print np.round(row,decimals=1)
print

# They call their functions
print 'PyCogent'
pca = metric_scaling.principal_coordinates_analysis
point_matrix, eigvals = pca(eucl)
print 'eigvals'
for e in eigvals:  print round(e,2),
print
print 'point_matrix'
for row in point_matrix:
    print np.round(row,decimals=1)
print
print 'original matrix (A)\n', A