Python for Bioinformatics: Dental project (4)

Thursday, March 10, 2011

Dental project (4)

This post is one of a series (see dental project here or in the sidebar).

After getting a set of sequences and removing chimeras, the next step is almost anticlimactic. We just copy a modified version of the shell script (from here, without the cd calls) or paste in the commands working from the dental directory (either individually, or all at once):

#!/bin/bash

pick_otus.py -i seqs.fna -m uclust -s 0.97 -o otus
pick_rep_set.py -f seqs.fna -i otus/seqs_otus.txt -m most_abundant -o otus/reps.txt
align_seqs.py -i otus/reps.txt -m pynast -t ~/data/core.txt -o aln
assign_taxonomy.py -i otus/reps.txt -m rdp -o tax
filter_alignment.py -i aln/reps_aligned.txt -m ~/data/mask.txt -o aln2
make_phylogeny.py -i aln2/reps_aligned_pfiltered.fasta -o figs/tree.tre
make_otu_table.py -i otus/seqs_otus.txt -t tax/reps_tax_assignments.txt -o figs/otu_table.txt
summarize_taxa.py -i figs/otu_table.txt -o figs/otu_table_Level3.txt -L 3
plot_taxa_summary.py -i figs/otu_table_Level3.txt -l Phylum -o figs -k white
make_otu_heatmap_html.py -i figs/otu_table.txt -o figs

It's all over in a few seconds.

The heatmap QIIME produced is at the top of the post. It is truly a remarkable html page, with a graphic where you can reorder the columns or rows by drag and drop, and redo the map at different threshholds for the OTUs, etc. I've never seen anything quite like it. But (and this is just me), it's not pretty enough.

So what I'd like to do from here is to show you how I currently make heatmaps with matplotlib, and we'll get into that next time.

First, I have to extract the data from QIIME. The script is complicated a bit by an additional job: I'm going to organize the rows and columns. (QIIME can do this too---see the tutorial).

The columns will be in the order they appear in sample_names.txt and the rows as they appear in genera_and_colors.txt. These files are in the same directory. The second one starts like this:

# Bacteria black
Bacteria
# Bacteroidetes green
Bacteroidetes
Bacteroidales
Rikenella
Prevotella
Porphyromonas
Saprospiraceae
Sphingobacteriales

I just do this:

> python grab_qiime_data.py > data.csv

Here's the first part of data.csv:

,DB,DC,DM,DF,DL,DG,DI,DA,DT,DQ,DV,DAA,DZ
Bacteria,,,,2,,,,,,,,,12
Bacteroidetes,,,,,,2,,,,,,,
Bacteroidales,,,,,,1,,,,,,,
Prevotella,,,,,3,2,,1,,,,,
Porphyromonas,,,,,,1,,1,,,,,
Sphingobacteriales,,,,,,,,,,1,,,
Capnocytophaga,9,,6,2,3,6,1,19,31,35,2,8,7
Fusobacterium,1,,,4,,2,1,2,,2,1,3,1

The leading comma on line 1 is so the column headers line up properly in a spreadsheet. Speaking of spreadsheets, here is a screenshot after dropping the data.csv file onto Numbers (you could use Excel, of course):

That was painless! Zipped project files in Dropbox (here).