I'm exploring Unifrac (first and second posts).
One of the nice aspects of UniFrac is the "multivariate analysis" it can do including PCoA (Principal Coordinates Analysis). I'm trying to understand how this works, and I have two posts to go through it. This first post is similar to stuff we've done before (e.g. here), but I want to try to present a coherent picture. Sorry for the repetition. The code is presented in two parts overall, and then today's section has two parts also. We begin by getting a bunch of integers in x-coordinates over the range from -20 to +20, and in y-coordinates randomly between -5 and +5. We turn those into a 2D array and plot them using
matplotlib
as red circles. Next, the array is rotated through 45 degrees and replotted (blue circles).
In the second part (of the first half), we use standard Principal Component Analysis (PCA) on these data points to recover their long dimension (now rotated) back to the x-axis. We calculate the covariance matrix for these two sets of points, and then do
np.linalg.eigh(m)
to get the eigenvalues and eigenvectors.The eigenvalues come out with the largest one last. We grab the first eigenvector (the last in the array we received), and do
which gives back the x-values we started with.
We can do the same with the second eigenvector to get the y-values. We can combine these two operations by re-ordering the eigenvectors:
In the figure, the C array has been plotted as black cross-bars, which match the original red circles pretty well. Next time, we'll take the same data points and look at them from the perspective of PCoA. We'll also see how PyCogent fares with our data, and how reasonable a picture it can estimate of what we started with.
Here's the code: