Python for Bioinformatics: Qiime (2)

I'm continuing with the QIIME overview tutorial (project link, tutorial link, first post here).

These scripts were all installed in a directory which is on my $PATH, they have the executable permission set, and they all have the shebang thing so they can be run using the script name only. Extensive help is available by running with -h only. Many more options are usually available. There is also extensive online help. An index page for all the scripts is here.

I renamed the example data files, which were originally Fasting_Example.*, to data.*, where * is 'fna' or 'qual' just to make them shorter, except the map file (mapping barcodes and primers to samples) is map.txt.

The zeroth step is to organize the sequences by sample, using the bar code. This identifier is the first few nt (here n=12), followed immediately by the primer (both are removed).

Step 0: split_libraries.py

required options:
-f sequence filename
-m map filename
-q quality scores filename
-o output sub-directory (will be created)

some default options:
-l MAX_SEQ_LEN [default:200]
-s MIN_QUAL_SCORE [default:25]
-M MAX_PRIMER_MM [default:0]

command:
split_libraries.py -f data.fna -m map.txt -q data.qual -o split

According to the log file in split/ we have 1333 sequences split among 9 samples, after filtering out 6 sequences for one of: >1 primer mismatch, > 0 ambiguous bases, quality score < 25. The sequences are in seqs.fna.

Step 1: pick_otus.py

required options:
-i input filename

some default options:
-m uclust
-c CLUSTERING_ALGORITHM [default:furthest] # for mothur
-s similarity [default:0.97]

command:
pick_otus.py -i split/seqs.fna -m uclust -s 0.97 -o otus

According to the log file in otus/ we have 417 otus (at 97% identity). These are in seqs_otus.txt identified by the names assigned in step 0.

Step 2: pick_rep_set.py

Pick representative sequences for each otu, to be used for taxonomy assignment and sequence alignment. There's no point in assigning or aligning a large number of essentially identical sequences (and would be impossible with a realistic data set).

required options:
-i input filename (otu mapping file from #1)
-f sequence filename
-m most_abundant

some default options:
-s SORT_BY [default: otu]

command:
pick_rep_set.py -f split/seqs.fna -i otus/seqs_otus.txt -m most_abundant -o otus/reps.txt

Step 3: align_seqs.py

Align representative otus using MUSCLE (for example) or PyNAST. PyNAST requires a template alignment, which for the time being I will just put into the qiime_tutorial directory, in gg/core.txt (gg stands for Greengenes).

required options:
-i input filename (representative FASTA seqs from #2)

some default options:
-m ALIGNMENT_METHOD [default: pynast]
-a PAIRWISE_ALIGNMENT_METHOD [default: uclust]
-d BLAST_DB [set using environment variable]
-t TEMPLATE_FP

command:
align_seqs.py -i otus/reps.txt -m pynast -t gg/core.txt -o aln

The output alignment is in aln/reps_aligned.txt. Here is the strangely sparse Greengenes type of alignment. You could use MUSCLE, but it takes a while.. Although for this relatively small example, it was just a few minutes.

command:
align_seqs.py -i otus/reps.txt -m muscle -o aln_muscle

Step 4: assign_taxonomy.py

Assign taxonomy using the RDP classifier.

required options:
-i input filename (representative FASTA seqs from #2)

some default options:
-m ASSIGNMENT_METHOD [default: rdp]
-c CONFIDENCE [default: 0.8]

command:
assign_taxonomy.py -i otus/reps.txt -m rdp -o tax

Step 5: filter_alignment.py

The Greengenes method results in many columns having all gaps. These should be removed using the lanemask file we downloaded from them. (I don't really understand all this yet---it's my next project).

required options:
-i input filename (representative FASTA seqs from #2)

some default options:
-m LANE_MASK_FP [default: none]
-g ALLOWED_GAP_FRAC [default: 0.999999]
-t  THRESHOLD [default: 3.0]

command:
filter_alignment.py -i aln/reps_aligned.txt -m gg/mask.txt -o aln2

Step 6: make_phylogeny.py

Just make a phylogenetic tree. Here, we use FastTree.

required options:
-i input filename (from filtered alignment, step 5)

some default options:
-t  fasttree
-r ROOT_METHOD [default: tree_method_default]

command:
make_phylogeny.py -i aln2/reps_aligned_pfiltered.fasta -o tree.tre

Step 7: make_otu_table.py

Does what it says

required options:
-i input filename (from picking otus, step 1)

some default options:
-t  TAXONOMY_FNAME [default: none] (step 4)
-e EXCLUDE_OTUS_FP (e.g. chimeras)
-o OUTPUT_FP [default: otu_table.txt]

command:
make_otu_table.py -i otus/seqs_otus.txt -t tax/reps_tax_assignments.txt

[UPDATE: -o default is actually to print to the screen]

Of course, you can run the whole "workflow" at once with (docs link):

pick_otus_through_otu_table.py

But first, you need to make a few changes to the file custom_parameters.txt as described in the tutorial. It provides options for all of the scripts in the sequence.

I also decided to download FigTree and try it out. The graphic at the top of the post is our tree output from step 6.

There is so much more to explore here:

we've just started with Qiime
we need to understand how PyNAST works
and how the RDP classifier works

And it would be nice to color the tree by Phylum or sample, or .. even better---plot a heat map! That's for next time.

Sunday, February 20, 2011

Qiime (2)