Saturday, March 12, 2011

Dental project (6)

I want to show some more results from this project, namely, the UniFrac analysis. What I did for the paper was to cluster very closely related sequences (alignment > 450 and 0 or 1 mismatches), then upload them to RDP, which aligns the sequences as they are uploaded. The phylogenetic tree needs to be rooted, and I decided to use Thermotoga SL7 for this (Genbank AJ401017).

Rather than deal with the clustered OTUs for this post, I just uploaded all 1120 sequences, and carried out the analysis. The first time through (today) I forgot to include the outgroup! So that gives us a chance to see how much difference it makes.

  • Working in directory: temp
  • Check that seqs.fna from dental project dir has 1120 seqs
  • Rename to dental_1120.fna
  • Upload to RDP
  • Download as dental_1120_rdp.fna

  • Use R/ape to make a tree


    setwd('Desktop/temp/rdp')
    library(ape)
    dna = read.dna('dental_1120_rdp.fna',format='fasta')
    tr = nj(dist.dna(dna))
    plot(tr)
    write.tree(tr,'tree.txt')


  • Write a simple script to make the environment file


    python write_env.py > environ.txt


    It looks like this:


    DAA_44  D_DAA
    DAA_43 D_DAA
    DQ_209 D_DQ
    ..


  • Upload tree and environment file to UniFrac
  • PCA (unweighted)
  • View as data table



  • Download data to pca_web.txt
  • Run the script below to plot the data using matplotlib.

    Here it is:




    From the UniFrac FAQ:


    My tree was not rooted, but I was able to upload my file and perform an analysis. Are the results valid?

    There is no way to tell based on a Newick string alone whether a tree is rooted or not. If an unrooted tree is input, UniFrac will usually assign an arbitrary root and allow you to perform the analysis on that tree. How the tree is rooted can affect the results of both UniFrac tests and the P test. You should redo the analysis with a tree that is rooted with an appropriate outgroup.


    It turns out to be easy enough..go back to RDP and browse to find SL7 and then add it to the sequence cart. Repeat the download to dental_1120+_rdp.fna. Load the last 5 sequences into clustalx.app and look to make sure that SL7 is really properly aligned.




    setwd('Desktop/temp/rdp')
    library(ape)
    dna = read.dna('dental_1120+_rdp.fna',format='fasta')
    tr = nj(dist.dna(dna))

    > tr

    Phylogenetic tree with 1121 tips and 1119 internal nodes.

    Tip labels:
    DAA_44, DAA_43, DQ_209, DAA_45, DAA_40, DC_81, ...

    Unrooted; includes branch lengths.

    > grep('SL', tr$tip.label)
    > 1121


  • Root the tree appropriately and write it to disk


    tr2=root(tr,1121)
    plot(tr2)
    write.tree(tr2,'rooted_tree.txt')


  • Go back to UniFrac



    Repeat the PCA. You can look at the data in a spreadsheet app:



    Now I plot it in matplotlib. The first image is what I plotted today for the rooted tree. The second is from the paper. Looks pretty good to me. Also, note some minor differences from the previous graphic where the tree we used was unrooted (and UniFrac rooted it for us however it does when it's not properly rooted).





    plot_web.py


    import sys
    import matplotlib.pyplot as plt
    from fileUtilities import load_data

    d = 0.5

    fn = 'pca_web.txt'
    data = load_data(fn)
    data = data.strip().split('\n\n')[0]
    data = data.strip().split('\n')[1:]
    L = list()

    for e in data:
    name, x, y = e.split()[:3]
    x,y = float(x), float(y)
    x *= -1
    name = name[2:]
    if name[1] in 'BCM': c = 'blue'
    else: c = 'red'
    plt.scatter(x,y,s=100,color=c)
    if name == 'DG':
    y += 0.03
    plt.text(x+0.03,y-0.02,va='center',
    s=name[1:],color=c,fontsize=16)
    plt.plot((-d,d),(0,0),':',zorder=0)
    plt.plot((0,0),(-d,d),':',zorder=0)

    ax = plt.axes()
    ax.set_xlim(-d,d)
    ax.set_ylim(-d,d)
    plt.savefig('pca_web.png')