## Wednesday, February 10, 2010

### Unifrac (6): understanding how PyCogent does PCoA

I'm exploring Unifrac (first and second posts).

Continuing with the example from last time, I am trying to understand how the PCoA analysis works in UniFrac. I have made some progress, with a lot of help from Justin Kuczynski. The theoretical basis for what we're going to do is given in a book called "Principles of Multivariate Analysis" by A. Krzanowski (Amazon, Google Books, p 107), (among other sources).

I don't understand it (yet).

But the basic approach is that we're going to do something similar to what is done in Principal Component Analysis (PCA, see here, for example), however, the mathematical treatment is different. In PCA we have data, values for a number of measurements (x,y,z, up to n dimensions) for each of a number of samples or entities. PCA will shows projections of the data onto two-dimensions (planes), where the planes are chosen so as to maximize the spread of the overall variance in that plane. We're reducing multi-dimensional data to fewer dimensions, while preserving as much of the structure as we can.

And in PCA, again, we make a covariation matrix, and then get the eigenvalues and eigenvectors for that matrix.

In the UniFrac flavor of Principal Coordinates Analysis (PCoA) we have a distance measurement on a number of samples (based of course, on the phylogenetic distance as measured by the UniFrac metric). What Krzanowski shows is how to do something similar to PCA but starting with the distance data, or indeed, any "measure of association."

Without worrying too much further about the individual steps, here they are. First, we recreate the distance matrix from our example:

 `import numpy as npfrom cogent.cluster import metric_scalingfrom numpy.linalg import eighAB = 0.69276243AC = 0.6549957BC = 0.60625465M = np.array( [0,AB,AC,AB,0,BC,AC,BC,0] )M.shape = (3,3)print 'M\n', M, '\n'`

 `M[[ 0. 0.69276243 0.6549957 ] [ 0.69276243 0. 0.60625465] [ 0.6549957 0.60625465 0. ]]`

Now, we have two simple arithmetic operations on the data:

 `# step 1E = metric_scaling.make_E_matrix(M)print 'E\n', Eprint 'M**2/-2\n', M**2/-2, '\n'`

 `E[[-0. -0.23995989 -0.21450968] [-0.23995989 -0. -0.18377235] [-0.21450968 -0.18377235 -0. ]]M**2/-2[[-0. -0.23995989 -0.21450968] [-0.23995989 -0. -0.18377235] [-0.21450968 -0.18377235 -0. ]] `

 `# step 2F = metric_scaling.make_F_matrix(E)print 'F\n', Frow_means = np.mean(E,axis=0)col_means = np.mean(E,axis=1)overall_mean = np.mean(E)F2 = E - row_means - col_means + overall_meanprint 'F2\n', F2, '\n\n'`

 `F[[ 0.16114818 -0.08905749 -0.07209069] [-0.08905749 0.14065662 -0.05159913] [-0.07209069 -0.05159913 0.12368982]]F2[[ 0.16114818 -0.08905749 -0.07209069] [-0.08905749 0.14065662 -0.05159913] [-0.07209069 -0.05159913 0.12368982]] `

 `# step 3eigvals, eigvecs = eigh(F)print 'eigvals\n', eigvalsprint 'eigvecs\n', eigvecsprinteigval_root = np.sqrt(abs(eigvals))print 'eigval_root\n', eigval_rootprintt_eigvecs = eigvecs.transpose()print 't_eigvecs\n', t_eigvecsprintprint t_eigvecs * eigval_root[:,np.newaxis]`

As you can see, the eigenvalues do not come out in sorted order. I have copied the code that UniFrac uses to massage the vectors from this point, but I'm not sure I understand it yet---and not sure there isn't a simpler way :).

 `eigvals[ 5.20417043e-18 1.80259579e-01 2.45235038e-01]eigvecs[[-0.57735027 0.18984739 0.79411878] [-0.57735027 0.59280334 -0.56147205] [-0.57735027 -0.78265073 -0.23264672]]eigval_root[ 2.28126509e-09 4.24569875e-01 4.95212115e-01]t_eigvecs[[-0.57735027 -0.57735027 -0.57735027] [ 0.18984739 0.59280334 -0.78265073] [ 0.79411878 -0.56147205 -0.23264672]][[ -1.31708902e-09 -1.31708902e-09 -1.31708902e-09] [ 8.06034836e-02 2.51686440e-01 -3.32289924e-01] [ 3.93257240e-01 -2.78047763e-01 -1.15209477e-01]]`

The final result bears some resemblance to what `fast_unifrac` gave us (see last time).

 `================================================================ Type Label vec_num-0 vec_num-1 vec_num-2----------------------------------------------------------------Eigenvectors A 0.39 0.08 -0.00Eigenvectors B -0.28 0.25 -0.00Eigenvectors C -0.12 -0.33 -0.00 Eigenvalues eigenvalues 0.25 0.18 0.00 Eigenvalues var explained (%) 57.64 42.36 0.00----------------------------------------------------------------`

Can you see that in the final output from the script, we have the x-values for each sample in order in the third row of the vector, and the y-values in the second row? The 3rd dimension is in the first row.

While there are still plenty of mysteries, I believe we have at least a handle on what is going on here. That's enough for now.