Python for Bioinformatics: Unifrac analysis: introduction

Monday, February 8, 2010

Unifrac analysis: introduction

I'm going to take a few posts to explore the "Unifrac tool for comparing microbial community diversity in a phylogenetic context." UniFrac was developed by Rob Knight and Cathy Lozupone and their colleagues at the University of Colorado, and has become an important statistical apprach in microbial ecology.

The methodological papers are:

• Lozupone
2005 PMID 16332807

• Lozupone
2006 PMID 16893466

• Hamady
2010 PMID 19710709

(The last one is behind a firewall and likely to remain there, unfortunately).

Both the original version of UniFrac, and a new implementation called FastUniFrac (Hamady paper above) have web interfaces. FastUniFrac is also part of PyCogent.

The specific goal is to understand the population structure of microbial communities. In "classical" ecology, we might count how many species are present in a given environment ("lions and tigers and bears"---to quote the Tin Woodsman), and the numbers of each. In microbial ecology, the data we have are DNA sequences, typically, 16S rRNA gene sequences. Since these are homologous sequences, we can obtain additional information including the phylogenetic distances between different individuals. UniFrac makes use of phylogenetic distance in quantifying community structure.

There is the complication of deciding when two sequences belong to the same "species" or OTU (operational taxonomic unit), which we will ignore for the moment.

Knight et al introduced the UniFrac "metric." Consider two idealized phylogenetic trees. In the first one, above left, the sequences from two environments (red and blue) are cleanly separated into monophyletic clusters. We count the amount of "unique branch length," the fraction of total branch length that leads exclusively to one or the other type of sequence, shown in either red or blue color in the figure.

In contrast, the second tree (right) has the identical structure but the environment labels have been switched for one pair of sequences. Now, a substantial fraction of branch length on the tree is not unique to one or the other label (shared branches are colored black in the figure).

The statistical significance of the UniFrac metric is determined by a resampling technique, randomizing the labels on sequences (in the same proportion as in the original sample). In a large number of replicates (say, 1000), the method counts the number of times the statistic is more extreme than for the observed tree. If this number is < 0.05, we consider the result significant.

One point that I'm pretty sure but not certain about:
• randomized samples contain the same total of blue or red sequences as in the original.

[UPDATE: an earlier version of this post had an error. The correct statement is that:
• the lengths of the terminal branches are considered when doing the calculation.]