I'm continuing with the QIIME overview tutorial (project link, tutorial link, first post here).
These scripts were all installed in a directory which is on my $PATH, they have the executable permission set, and they all have the shebang thing so they can be run using the script name only. Extensive help is available by running with -h only. Many more options are usually available. There is also extensive online help. An index page for all the scripts is here.
I renamed the example data files, which were originally
data.*, where * is 'fna' or 'qual' just to make them shorter, except the map file (mapping barcodes and primers to samples) is
The zeroth step is to organize the sequences by sample, using the bar code. This identifier is the first few nt (here n=12), followed immediately by the primer (both are removed).
Step 0: split_libraries.py
According to the log file in
split/we have 1333 sequences split among 9 samples, after filtering out 6 sequences for one of: >1 primer mismatch, > 0 ambiguous bases, quality score < 25. The sequences are in
Step 1: pick_otus.py
According to the log file in
otus/we have 417 otus (at 97% identity). These are in
seqs_otus.txtidentified by the names assigned in step 0.
Step 2: pick_rep_set.pyPick representative sequences for each otu, to be used for taxonomy assignment and sequence alignment. There's no point in assigning or aligning a large number of essentially identical sequences (and would be impossible with a realistic data set).
Step 3: align_seqs.pyAlign representative otus using MUSCLE (for example) or PyNAST. PyNAST requires a template alignment, which for the time being I will just put into the qiime_tutorial directory, in
gg/core.txt(gg stands for Greengenes).
The output alignment is in
aln/reps_aligned.txt. Here is the strangely sparse Greengenes type of alignment. You could use MUSCLE, but it takes a while.. Although for this relatively small example, it was just a few minutes.
Step 4: assign_taxonomy.pyAssign taxonomy using the RDP classifier.
Step 5: filter_alignment.pyThe Greengenes method results in many columns having all gaps. These should be removed using the lanemask file we downloaded from them. (I don't really understand all this yet---it's my next project).
Step 6: make_phylogeny.pyJust make a phylogenetic tree. Here, we use FastTree.
Step 7: make_otu_table.pyDoes what it says
[UPDATE: -o default is actually to print to the screen]
Of course, you can run the whole "workflow" at once with (docs link):
But first, you need to make a few changes to the file
custom_parameters.txtas described in the tutorial. It provides options for all of the scripts in the sequence.
I also decided to download FigTree and try it out. The graphic at the top of the post is our tree output from step 6.
There is so much more to explore here:
we've just started with Qiime
we need to understand how PyNAST works
and how the RDP classifier works
And it would be nice to color the tree by Phylum or sample, or .. even better---plot a heat map! That's for next time.