Wednesday, February 10, 2010

Unifrac (6): understanding how PyCogent does PCoA

I'm exploring Unifrac (first and second posts).

Continuing with the example from last time, I am trying to understand how the PCoA analysis works in UniFrac. I have made some progress, with a lot of help from Justin Kuczynski. The theoretical basis for what we're going to do is given in a book called "Principles of Multivariate Analysis" by A. Krzanowski (Amazon, Google Books, p 107), (among other sources).

I don't understand it (yet).

But the basic approach is that we're going to do something similar to what is done in Principal Component Analysis (PCA, see here, for example), however, the mathematical treatment is different. In PCA we have data, values for a number of measurements (x,y,z, up to n dimensions) for each of a number of samples or entities. PCA will shows projections of the data onto two-dimensions (planes), where the planes are chosen so as to maximize the spread of the overall variance in that plane. We're reducing multi-dimensional data to fewer dimensions, while preserving as much of the structure as we can.

And in PCA, again, we make a covariation matrix, and then get the eigenvalues and eigenvectors for that matrix.

In the UniFrac flavor of Principal Coordinates Analysis (PCoA) we have a distance measurement on a number of samples (based of course, on the phylogenetic distance as measured by the UniFrac metric). What Krzanowski shows is how to do something similar to PCA but starting with the distance data, or indeed, any "measure of association."

Without worrying too much further about the individual steps, here they are. First, we recreate the distance matrix from our example:

import numpy as np
from cogent.cluster import metric_scaling
from numpy.linalg import eigh

AB = 0.69276243
AC = 0.6549957
BC = 0.60625465
M = np.array( [0,AB,AC,AB,0,BC,AC,BC,0] )
M.shape = (3,3)
print 'M\n', M, '\n'

[[ 0. 0.69276243 0.6549957 ]
[ 0.69276243 0. 0.60625465]
[ 0.6549957 0.60625465 0. ]]

Now, we have two simple arithmetic operations on the data:

# step 1
E = metric_scaling.make_E_matrix(M)
print 'E\n', E
print 'M**2/-2\n', M**2/-2, '\n'

[[-0. -0.23995989 -0.21450968]
[-0.23995989 -0. -0.18377235]
[-0.21450968 -0.18377235 -0. ]]
[[-0. -0.23995989 -0.21450968]
[-0.23995989 -0. -0.18377235]
[-0.21450968 -0.18377235 -0. ]]

# step 2
F = metric_scaling.make_F_matrix(E)
print 'F\n', F

row_means = np.mean(E,axis=0)
col_means = np.mean(E,axis=1)
overall_mean = np.mean(E)
F2 = E - row_means - col_means + overall_mean
print 'F2\n', F2, '\n\n'

[[ 0.16114818 -0.08905749 -0.07209069]
[-0.08905749 0.14065662 -0.05159913]
[-0.07209069 -0.05159913 0.12368982]]
[[ 0.16114818 -0.08905749 -0.07209069]
[-0.08905749 0.14065662 -0.05159913]
[-0.07209069 -0.05159913 0.12368982]]

# step 3
eigvals, eigvecs = eigh(F)
print 'eigvals\n', eigvals
print 'eigvecs\n', eigvecs

eigval_root = np.sqrt(abs(eigvals))
print 'eigval_root\n', eigval_root

t_eigvecs = eigvecs.transpose()
print 't_eigvecs\n', t_eigvecs
print t_eigvecs * eigval_root[:,np.newaxis]

As you can see, the eigenvalues do not come out in sorted order. I have copied the code that UniFrac uses to massage the vectors from this point, but I'm not sure I understand it yet---and not sure there isn't a simpler way :).

[ 5.20417043e-18 1.80259579e-01 2.45235038e-01]
[[-0.57735027 0.18984739 0.79411878]
[-0.57735027 0.59280334 -0.56147205]
[-0.57735027 -0.78265073 -0.23264672]]

[ 2.28126509e-09 4.24569875e-01 4.95212115e-01]

[[-0.57735027 -0.57735027 -0.57735027]
[ 0.18984739 0.59280334 -0.78265073]
[ 0.79411878 -0.56147205 -0.23264672]]

[[ -1.31708902e-09 -1.31708902e-09 -1.31708902e-09]
[ 8.06034836e-02 2.51686440e-01 -3.32289924e-01]
[ 3.93257240e-01 -2.78047763e-01 -1.15209477e-01]]

The final result bears some resemblance to what fast_unifrac gave us (see last time).

Type Label vec_num-0 vec_num-1 vec_num-2
Eigenvectors A 0.39 0.08 -0.00
Eigenvectors B -0.28 0.25 -0.00
Eigenvectors C -0.12 -0.33 -0.00
Eigenvalues eigenvalues 0.25 0.18 0.00
Eigenvalues var explained (%) 57.64 42.36 0.00

Can you see that in the final output from the script, we have the x-values for each sample in order in the third row of the vector, and the y-values in the second row? The 3rd dimension is in the first row.

While there are still plenty of mysteries, I believe we have at least a handle on what is going on here. That's enough for now.

No comments: