Sunday, February 27, 2011
Flag day
Here are the flags of the countries of origin for visitors since I put up the flag widget in the sidebar (flags from Wikipedia, apologies for any error).
I tried to cluster them, but couldn't think of a good algorithm :)
colors and symbols I guess.
Thanks for reading!
UPDATE: welcome to:
Qiime (9) shell scripts
This is the README.txt file of a zipped directory on Dropbox (here). It's been tested on OS X and on a Virtual Box running Linux with QIIME (my post here).
The scripts in this directory are intended to simplify the process followed in the Qiime overview tutorial:
http://qiime.sourceforge.net/tutorials/tutorial.html |
What they do:
They do require that Qiime is already installed, and that the files
core.txt
and mask.txt
be available in ~/data
.To use this, copy this directory (
qiime_shellscripts
) to the Desktop. Fire up the Terminal and do:
~/Desktop/qiime_shellscripts/download.sh |
This script will download the Qiime overview tutorial and unpack it to a new directory
qiime_tutorial
on your Desktop. On Linux, you will need to substitute wget
for curl
(or install curl
). As things progress, it is likely that you will have to edit the version number in the script. The zipped download remains on the Desktop, so if things get messed up, you won't have to repeat yourself. Then do:~/Desktop/qiime_shellscripts/preliminary.sh |
This will do step 0, splitting the sequences. Next do:
~/Desktop/qiime_shellscripts/run_all.sh |
If you wish to re-run the workflow analyses individually, you can first remove results of the previous work with:
~/Desktop/qiime_shellscripts/clean.sh |
This last script does not reverse the split step.
If you need a new copy of this directory, it should still be on Dropbox:
http://dl.dropbox.com/u/3534458/qiime_shellscripts.zip |
Saturday, February 26, 2011
Qiime (8) In the Virtual Box
We still have more to do in the tutorial, but this post is about the virtual machine with QIIME installed. The website discussion for this is here. My thinking is that for a real 454 or Illumina data set (which I should have at some point in the future), I will want to run "in the cloud." It's all over the web:
If you need milk, would you buy a cow ?'.
To do that I will need to get comfortable with Linux and Qiime in this environment. The Qiime website has discussion about the EC2 "image" (here), but for today's example I used the Virtual Box (VB) with Qiime.
A couple of issues had to be addressed first. There is something called "Guest Additions" that needs to be installed. The way I did it was to do:
cd /media/VBOXADDITIONS_4.0.4_70112/ |
and then restart the Qiime VB. The second is to train my brain to use
command (Apple symbol) on OS X |
And finally, I found the password (qiime---had to search the QIIME Forum for that). The Linux install doesn't have
curl
so I used:wget http://bmf.colorado.edu/QIIME/qiime_tutorial-v1.2.1.zip |
Then I just followed my own instructions from here and here. I got a failure with:
plot_taxa_summary.py: command not found |
which I fixed by:
svn co https://qiime.svn.sourceforge.net/svnroot/qiime/trunk Qiime |
Failed the following unit tests. |
I'm not too worried about the failures. I'll look into it.
There's a screenshot at the top of the post. Looking good.
[ UPDATE: Instructions for shared folders with the host OS here. ]
Thursday, February 24, 2011
Qiime (7) Area Plot
We were plotting a summary chart in QIIME (here). Turns out I just skipped a step:
summarize_taxa.py -i figs/otu_table.txt -o figs/otu_table_Level3.txt -L 3 |
Qiime (6) quick version
This example is for the impatient (like me): running part 1 of the QIIME tutorial--checklist version.
[ UPDATE Note: the "workflow script" in the tutorial is actually easier to run than what I have here. It needs only three changes: paths to the Greengenes files (below) and changing the rep set picking method to
most_abundant
, if desired. This post is about doing the workflow in such a way that the scripts are invoked individually. ]Prerequisites (as discussed here and here)
Install and test:
Make sure you have the equivalent of this (some scripts don't seem to look at
$PATH
):> cat ~/.qiime_config |
and that
~/bin/qiime/bin
is on your $PATH
> pick_otus.py -h | head -n |
From the Desktop, get the Greengenes core set data file (38 MB download link). Also get the Greengenes alignment lanemask file (download link). And get the tutorial files (download link), or do:
curl -O http://bmf.colorado.edu/QIIME/qiime_tutorial-v1.2.0.zip |
Make some directories:
mkdir qiime_tutorial/gg |
For the next two steps, if you get an error, recheck the file extensions:
mv core_set_aligned.fasta.imputed qiime_tutorial/gg/core.txt |
I prefer short names, so these directory and file names are different than those in the tutorial.
cd qiime_tutorial |
Step 0:
split_libraries.py -f data.fna -m map.txt -q data.qual -o split |
At this point, you could make a backup (including the huge core.txt):
cp -r ../qiime_tutorial ../qiime_tutorial_backup |
Analysis:
pick_otus.py -i seqs.fna -m uclust -s 0.97 -o otus |
Rather than do another heatmap, I thought I'd show a summary area chart
plot_taxa_summary.py -i figs/otu_table.txt -l Phylum -o figs -k white |
Unfortunately, it's not what we want (the graphic at top seems to be plotting the individual OTUs rather than Phyla). The example in the tutorial uses
otu_table_Level3.txt
I don't see an option for controlling the level of output in
make_otu_table.py
. I'll have to investigate how to do that. Perhaps if we follow the workflow script it will turn out correctly.
Tuesday, February 22, 2011
Qiime (5) alpha diversity
Continuing with the exploration of QIIME (earlier posts: one two three four).
We're following the overview tuturial (here), which has four main parts. There are the three big sections of
picking and analysis of species/OTUs
analysis of alpha diversity
analysis of beta diversity
plus
a suite of data visualization programs
Today we're thinking about alpha diversity (within sample diversity). Goals of diversity analysis (using 16S rRNA sequences in bacteria) include measurement of:
observed species/OTU richness
population distribution among species/OTUs
phylogenetic diversity among OTUs
where OTUs can be defined at different thresholds of sequence identity.
For this post, I've got screenshots of two graphics produced from the Qiime tutorial analysis of alpha diversity. Consider the first graphic, which plots observed species in two different samples as a function of the number of individual sequences examined. Because chance influences the order in which samples are obtained, resampling techniques are used to generate a rarefaction curve that averages the results of many random samplings of the observed data. Perhaps more important, it allows normalization for the number of samples observed, allowing comparison of samples with different sizes.
These plots are rarefaction curves.
The most important questions are probably these two:
(1) If we could sample exhaustively, what would be the final species count (or phylogenetic distribution or.. ). In other words, does the curve level off, and what is the asymptotic value?
(2) Can we decide whether two different populations differ significantly even without exhaustive sampling?
I do not know much about this area, so I should probably just be quiet at this point, but I have to say that I am suspicious that the first question does not always have a good answer (even if people would wish for one). The problem is that the shape of the curve as it goes out into high number of samples does not necessarily depend on the shape at lower numbers. It might do so, if the population structure is not too skewed. But no one can say in advance whether that is true or not.
Anyway, we follow the tutorial. First remake (and this time, save) the otu_table:
We'll use the workflow script (I don't see too much of interest after a cursory look at the intermediate steps). We can get the required options with -h (help):
What we actually run is:
command:
The default settings in
If you do want to run each of the steps individually, I suggest you run the workflow script, then the log file in will contain a history of the commands that were executed:
The second graphic shows that considering phylogenetic diversity may reveal significant differences that are not seen when just counting OTUs.
We're following the overview tuturial (here), which has four main parts. There are the three big sections of
custom_parameters.txt
: plus
Today we're thinking about alpha diversity (within sample diversity). Goals of diversity analysis (using 16S rRNA sequences in bacteria) include measurement of:
where OTUs can be defined at different thresholds of sequence identity.
For this post, I've got screenshots of two graphics produced from the Qiime tutorial analysis of alpha diversity. Consider the first graphic, which plots observed species in two different samples as a function of the number of individual sequences examined. Because chance influences the order in which samples are obtained, resampling techniques are used to generate a rarefaction curve that averages the results of many random samplings of the observed data. Perhaps more important, it allows normalization for the number of samples observed, allowing comparison of samples with different sizes.
These plots are rarefaction curves.
The most important questions are probably these two:
(1) If we could sample exhaustively, what would be the final species count (or phylogenetic distribution or.. ). In other words, does the curve level off, and what is the asymptotic value?
(2) Can we decide whether two different populations differ significantly even without exhaustive sampling?
I do not know much about this area, so I should probably just be quiet at this point, but I have to say that I am suspicious that the first question does not always have a good answer (even if people would wish for one). The problem is that the shape of the curve as it goes out into high number of samples does not necessarily depend on the shape at lower numbers. It might do so, if the population structure is not too skewed. But no one can say in advance whether that is true or not.
Anyway, we follow the tutorial. First remake (and this time, save) the otu_table:
make_otu_table.py -i otus/seqs_otus.txt -t tax/reps_tax_assignments.txt -o figs/otu_table.txt |
We'll use the workflow script (I don't see too much of interest after a cursory look at the intermediate steps). We can get the required options with -h (help):
> alpha_rarefaction.py -h |
What we actually run is:
command:
alpha_rarefaction.py -i figs/otu_table.txt -m map.txt -o rare/ -p custom_parameters.txt -t figs/tree.tre -f |
The default settings in
custom_parameters.txt
are fine for this.If you do want to run each of the steps individually, I suggest you run the workflow script, then the log file in will contain a history of the commands that were executed:
# Alpha rarefaction command |
The second graphic shows that considering phylogenetic diversity may reveal significant differences that are not seen when just counting OTUs.
Fetching sequences from NCBI
Here's the problem I want to solve:
I need to collect sequences from NCBI for a small database of 16S rRNA gene sequences from bacteria of interest. Many of these are from clonal isolates so the FASTA-formatted sequences have title lines like
I can parse that title, or I also have the Genbank identifier in a database entry like this:
I grab the corresponding sequence using PyCogent (as discussed in the post here):
However, I also want to get some rRNA sequences from genome projects. In that case the FASTA-formated RefSeq record is huge, and I just want a small part of it.
From the prokaryote genome projects page at NCBI, I click on the RefSeq link for E. coli MG1655, and on the RefSeq page I click on Structural RNAs
I find what I'm looking for halfway down the page:
I copy the link for "DNA region in flatfile format" (the green diamond)
I tried feeding the range information (from and to) to EFetch but got nowhere. And I'm not sure if this is the best way, but here is what I did. I browsed through
I just added a couple of keys to the dict which is used to construct the url. In cogent.utils.ncbi after line 133 I inserted:
Now this works:
Genbank is smart enough to figure out that the id we sent is not a real gi but refers to a record in the RefSeq database. Rather than using from and to, you'd think it would be OK to use something like:
but I couldn't get it past EFetch, even with URL encodings. Guess I should run this by the PyCogent crew for issues. Maybe they'll want to put it in the toolkit.
I need to collect sequences from NCBI for a small database of 16S rRNA gene sequences from bacteria of interest. Many of these are from clonal isolates so the FASTA-formatted sequences have title lines like
>gi|29290057|gb|AY207063.1| Abiotrophia sp. oral clone P4PA_155 P1 16S ribosomal RNA gene, partial sequence |
I can parse that title, or I also have the Genbank identifier in a database entry like this:
AY207063.1 Abiotrophia_clone_P4PA_155 |
I grab the corresponding sequence using PyCogent (as discussed in the post here):
from cogent.db.ncbi import EFetch |
> python fetch.py |
However, I also want to get some rRNA sequences from genome projects. In that case the FASTA-formated RefSeq record is huge, and I just want a small part of it.
From the prokaryote genome projects page at NCBI, I click on the RefSeq link for E. coli MG1655, and on the RefSeq page I click on Structural RNAs
I find what I'm looking for halfway down the page:
16S ribosomal RNA of rrnB operon
I copy the link for "DNA region in flatfile format" (the green diamond)
http://www.ncbi.nlm.nih.gov/nuccore/NC_000913?from=4164682&to=4166223 |
I tried feeding the range information (from and to) to EFetch but got nowhere. And I'm not sure if this is the best way, but here is what I did. I browsed through
class EFetch(UrlGetter) in |
I just added a couple of keys to the dict which is used to construct the url. In cogent.utils.ncbi after line 133 I inserted:
'from', 'to', |
Now this works:
from cogent.db.ncbi import EFetch |
> python fetch.py |
Genbank is smart enough to figure out that the id we sent is not a real gi but refers to a record in the RefSeq database. Rather than using from and to, you'd think it would be OK to use something like:
NC_000913:4164682-4166223 |
but I couldn't get it past EFetch, even with URL encodings. Guess I should run this by the PyCogent crew for issues. Maybe they'll want to put it in the toolkit.
Monday, February 21, 2011
One world
wikimedia
I would be somewhat embarrassed if you knew how often I look at the Stats for the blog. See, I'm hoping someone will find it interesting. And I've been very excited recently to discover in my audience what must be new visitors to the site, from Hungary, Poland, New Zealand, Argentina.. I certainly don't remember seeing those before.
So I googled a bit and found Flag Counter. Hope you like it! (Hope it works).
And the story with the flag is this. A long time ago, a friend of mine and I planned (using the word very loosely) that one day we would go to Tonga and set up a (small) research institute. Among other things, we were excited by the fact that the King was a surfer. This idea captured my imagination. So I'm hoping that one day, I will see his flag in the sidebar. Let me know if I miss it.
Qiime (4) PyNAST
Today I've been on a side-track from the QIIME pipeline (or toolkit), related to the construction of multiple sequence alignments (MSA). The favored approach in QIIME is that of NAST (DeSantis et al. 2006 PMID 16845035) as reimplemented in PyNAST using Python (Caporaso et al. 2010 PMID 19914921).
The fundamental problem with MSA is that it scales very poorly. Analysis of the complexity is itself complex (see the discussion in Edgar 2004 PMID 15318951), but Edgar gives it as O(N4 + L2) overall (N sequences of typical length L). Though of similar complexity, MUSCLE is significantly faster. Still, it grinds quite slowly when you have a lot of sequences (say, > 500).
From the first MUSCLE paper:
NAST (Nearest Alignment Space Termination) is one solution to this problem. The basic idea is simple: use an existing multiple sequence alignment as a framework against which to align new sequences. The trick is that addition of a new sequence will not change the length of the MSA. Besides the algorithm, which we'll talk about, a significant part of the effort going into NAST has been the construction of a large, curated database of high quality bacterial and archaeal 16S rRNA sequences.
The implementation has these steps:
• identify the best matching sequence to the candidate in the MSA database
• trim the candidate and do a global alignment to this best match
• resolve positions where the alignment introduces a gap into the template
The example from DeSantis et al is below (T = template; C = candidate). I've shown the template spaces as underlines to distinguish them from alignment gaps:
Collapse the template spacing:
Construct the pairwise alignment
Re-introduce template spacing
If you look at the template in the pairwise alignment you will see four positions containing introduced gap characters. Two of these can be accomodated by gaps in the original template spacing, and two cannot (marked with *). These will have to go.
The * positions must be collapsed to maintain the constant length MSA. In the case of the first one, the aligned GT to its left is displaced. In the second one, the gap between T_A in the candidate is consumed. The final result is:
Note that in both cases, a mismatch was accepted in order to repair the gap. This is the weakness of the NAST approach.
The template may not look exactly the same as when it started, but this difference is not propagated to the database.
We obtained two files from Greengenes following the Qiime tutorial (my post here). The first is a "core_set":
It contains nearly five thousand high quality 16S ribosomal RNA gene sequences that have been aligned. Each one of those sequences consists mostly of '-' gap characters. Almost all of those positions are gaps in every single sequence. These are indexed using the "lane mask":
Naturally, the "core_set" is a large file
I made some fake data to run through pynast:
Sequence B is the same as A, but missing the first nt, C is missing the first 6 nt, and D has a substitution for the first 6 nt.
The template looks like this:
There's a warning about not having code for parallel processing. And here's the alignment:
One of the things that's been overlooked in the present exploration is the crucial contribution of another Robert Edgar program,
The fundamental problem with MSA is that it scales very poorly. Analysis of the complexity is itself complex (see the discussion in Edgar 2004 PMID 15318951), but Edgar gives it as O(N4 + L2) overall (N sequences of typical length L). Though of similar complexity, MUSCLE is significantly faster. Still, it grinds quite slowly when you have a lot of sequences (say, > 500).
From the first MUSCLE paper:
In agreement with other studies, [e.g. Katoh et al. (8)], we found that T-Coffee was unable to align more than approximately 102 sequences of typical length on a current desktop computer. CLUSTALW was able to align a few hundred sequences, with a practical limit around N = 103 where CPU time begins to scale approximately as N4. The largest set had 5000 sequences of average length 350. MUSCLE-p completed this test in 7 min, compared with 10 min for FFTNS1; we estimate that CLUSTALW would need approximately 1 year.
NAST (Nearest Alignment Space Termination) is one solution to this problem. The basic idea is simple: use an existing multiple sequence alignment as a framework against which to align new sequences. The trick is that addition of a new sequence will not change the length of the MSA. Besides the algorithm, which we'll talk about, a significant part of the effort going into NAST has been the construction of a large, curated database of high quality bacterial and archaeal 16S rRNA sequences.
The implementation has these steps:
• identify the best matching sequence to the candidate in the MSA database
• trim the candidate and do a global alignment to this best match
• resolve positions where the alignment introduces a gap into the template
The example from DeSantis et al is below (T = template; C = candidate). I've shown the template spaces as underlines to distinguish them from alignment gaps:
T: ATAC_____GTA_AC____GTA___C___G_T_AC_GG |
Collapse the template spacing:
T: ATACGTAACGTACGTACGG |
Construct the pairwise alignment
T: ATACGT-A-ACGTACGTAC--GG |
Re-introduce template spacing
* * |
If you look at the template in the pairwise alignment you will see four positions containing introduced gap characters. Two of these can be accomodated by gaps in the original template spacing, and two cannot (marked with *). These will have to go.
The * positions must be collapsed to maintain the constant length MSA. In the case of the first one, the aligned GT to its left is displaced. In the second one, the gap between T_A in the candidate is consumed. The final result is:
T: ATAC_____GT-AAC____GTA___C___G_T_AC_GG |
Note that in both cases, a mismatch was accepted in order to repair the gap. This is the weakness of the NAST approach.
The template may not look exactly the same as when it started, but this difference is not propagated to the database.
We obtained two files from Greengenes following the Qiime tutorial (my post here). The first is a "core_set":
>>> FH = open('core_set_aligned.fasta.imputed.txt') |
It contains nearly five thousand high quality 16S ribosomal RNA gene sequences that have been aligned. Each one of those sequences consists mostly of '-' gap characters. Almost all of those positions are gaps in every single sequence. These are indexed using the "lane mask":
>>> FH = open('lanemask_in_1s_and_0s.txt') |
Naturally, the "core_set" is a large file
>>> data.count('>')*len(mask) |
I made some fake data to run through pynast:
>0 A |
Sequence B is the same as A, but missing the first nt, C is missing the first 6 nt, and D has a substitution for the first 6 nt.
The template looks like this:
>Y |
> pynast -i data.txt -t template.txt -l 25 |
There's a warning about not having code for parallel processing. And here's the alignment:
>0 A 1..41 |
One of the things that's been overlooked in the present exploration is the crucial contribution of another Robert Edgar program,
uclust
. That's used both for MSA and at least one other step in the QIIME pipeline. I need to read the paper and we'll look at it another day.
Sunday, February 20, 2011
Qiime (3) Heatmap
This example should really be part of the previous post (here), about the first segment of the overview tutorial for Qiime (here). One script in Qiime makes a heatmap, and it is quite something. To do this example, I modified #7 from last time to save
otu_table.txt
in a sub-directory rather than write to the screen. The second command makes the heat map.Here are the commands:
make_otu_table.py -i otus/seqs_otus.txt -t tax/reps_tax_assignments.txt -o figs/otu_table.txt |
The first thing is that the output is html and can be displayed in the browser (great idea). The second is that it's quite complicated, with more than 5000 lines of javascript required for display. It allows drag and drop of the table rows, and custom mapping files to control the sorted order of the table rows. There are "tool tips" to display detail for each cell.
The primary author, Jesse Stombaugh, and also the library authors Erik Bosrup (overLIB) and John Resig (jQuery) obviously put a huge amount of effort into this.
I am not sure why the default output from this example (by Taxonomy) looks so weird. Anybody know why there are so many duplicate row names?
I would probably make the red used for the highest counts a brighter color.
Put this on the list for more investigation.
Qiime (2)
I'm continuing with the QIIME overview tutorial (project link, tutorial link, first post here).
These scripts were all installed in a directory which is on my $PATH, they have the executable permission set, and they all have the shebang thing so they can be run using the script name only. Extensive help is available by running with -h only. Many more options are usually available. There is also extensive online help. An index page for all the scripts is here.
I renamed the example data files, which were originally
Fasting_Example.*
, to data.*
, where * is 'fna' or 'qual' just to make them shorter, except the map file (mapping barcodes and primers to samples) is map.txt
.The zeroth step is to organize the sequences by sample, using the bar code. This identifier is the first few nt (here n=12), followed immediately by the primer (both are removed).
Step 0: split_libraries.py
required options: |
According to the log file in
split/
we have 1333 sequences split among 9 samples, after filtering out 6 sequences for one of: >1 primer mismatch, > 0 ambiguous bases, quality score < 25. The sequences are in seqs.fna
.Step 1: pick_otus.py
required options: |
According to the log file in
otus/
we have 417 otus (at 97% identity). These are in seqs_otus.txt
identified by the names assigned in step 0.Step 2: pick_rep_set.py
Pick representative sequences for each otu, to be used for taxonomy assignment and sequence alignment. There's no point in assigning or aligning a large number of essentially identical sequences (and would be impossible with a realistic data set).required options: |
Step 3: align_seqs.py
Align representative otus using MUSCLE (for example) or PyNAST. PyNAST requires a template alignment, which for the time being I will just put into the qiime_tutorial directory, ingg/core.txt
(gg stands for Greengenes).required options: |
The output alignment is in
aln/reps_aligned.txt
. Here is the strangely sparse Greengenes type of alignment. You could use MUSCLE, but it takes a while.. Although for this relatively small example, it was just a few minutes.command: |
Step 4: assign_taxonomy.py
Assign taxonomy using the RDP classifier.required options: |
Step 5: filter_alignment.py
The Greengenes method results in many columns having all gaps. These should be removed using the lanemask file we downloaded from them. (I don't really understand all this yet---it's my next project).required options: |
Step 6: make_phylogeny.py
Just make a phylogenetic tree. Here, we use FastTree.required options: |
Step 7: make_otu_table.py
Does what it saysrequired options: |
[UPDATE: -o default is actually to print to the screen]
Of course, you can run the whole "workflow" at once with (docs link):
pick_otus_through_otu_table.py |
But first, you need to make a few changes to the file
custom_parameters.txt
as described in the tutorial. It provides options for all of the scripts in the sequence.I also decided to download FigTree and try it out. The graphic at the top of the post is our tree output from step 6.
There is so much more to explore here:
we've just started with Qiime
we need to understand how PyNAST works
and how the RDP classifier works
And it would be nice to color the tree by Phylum or sample, or .. even better---plot a heat map! That's for next time.
Saturday, February 19, 2011
QIIME (1)
Yesterday and this morning, I've been working on installing QIIME and doing the tutorial. It stands for: Quantitative Insights into Microbial Ecology (link). It's yet another major software project out of Rob Knight's lab.
I'd like to post on the issues/questions/answers I run into as I work through it. It'll also give me an opportunity to get back to Bioinformatics (e.g. multiple sequence alignments).
The first thing is that they recommend you use a virtual machine (Virtual Box) and install the QIIME Virtual Box on it, which is a huge download that packages Ubuntu Linux with all of the QIIME dependencies, correctly configured. Sounds like a great idea.
I started by trying this. I had to download overnight on my machine at work (> 2 GB), then buy a new thumb drive to bring the file home, since I don't have the recommended amount of memory for the VM (1 GB) on that machine. But I ran into trouble---basically much aggravation dealing with the different keys on Linux, but also the Virtual Box and extra work trying to figure out copy/paste, moving files over, etc. The killer was when the VM prompted for an administrator's password, and of course I don't have it. There shouldn't be one needed..
[ After a search of the Forum the mystery word is revealed to be: qiime. I shoulda guessed it. ]
So I decided instead to deal with the dependencies for QIIME. I got almost all of them (like 24 or so), but for the "default pipeline" you really don't need so many. Particularly if you already have PyCogent and matplotlib installed, as I do, I would recommend just going down the list. It's not difficult, and it's better to be in an environment where you're comfortable working.
There was really no trouble configuring them (at least the essentials).
Not sure what the default OS X version is, but
and I put
unpack and move to Software
After installing
That's pretty silly.
Where to put them? I put them in qiime_tutorial, see below, but will eventually want them in Qiime somewhere.
move to Software and link
in
Check for mismatch between the name of the classifier file. What I actually got was named 2.0 not 2.0.1.
I followed their advice and put the install scripts in a special place (but I wouldn't do that again).
The problem with the special directory is you will need a
Make sure to put it in
Also not needed yet but I'll list them here anyway:
We've done that one before. The
I upgraded to R 2.12.1 from 2.10.0, but it has some issue. Trying to get the ape package from CRAN hangs the app. I backed off to 2.11.1 (see here).
But they're not needed for the first part of the tutorial, so we should just do that first. Oh..I also have
I'd like to post on the issues/questions/answers I run into as I work through it. It'll also give me an opportunity to get back to Bioinformatics (e.g. multiple sequence alignments).
The first thing is that they recommend you use a virtual machine (Virtual Box) and install the QIIME Virtual Box on it, which is a huge download that packages Ubuntu Linux with all of the QIIME dependencies, correctly configured. Sounds like a great idea.
I started by trying this. I had to download overnight on my machine at work (> 2 GB), then buy a new thumb drive to bring the file home, since I don't have the recommended amount of memory for the VM (1 GB) on that machine. But I ran into trouble---basically much aggravation dealing with the different keys on Linux, but also the Virtual Box and extra work trying to figure out copy/paste, moving files over, etc. The killer was when the VM prompted for an administrator's password, and of course I don't have it. There shouldn't be one needed..
[ After a search of the Forum the mystery word is revealed to be: qiime. I shoulda guessed it. ]
So I decided instead to deal with the dependencies for QIIME. I got almost all of them (like 24 or so), but for the "default pipeline" you really don't need so many. Particularly if you already have PyCogent and matplotlib installed, as I do, I would recommend just going down the list. It's not difficult, and it's better to be in an environment where you're comfortable working.
There was really no trouble configuring them (at least the essentials).
#1: Python 2.6
I have it, since I'm on OS X. I ran into a little trouble because (from playing with MacPorts) I still had/opt/local/bin
on my path as changed in .bash_profile
. The QIIME install checks this variable and modifies its scripts (the she-bang thing), so I had to fix that later on.#2: Numpy 1.3.0
I have this too.>>> import numpy |
Not sure what the default OS X version is, but
easy_install
will fix you up quickly.#3: PyCogent
Of course, this has its own dependencies. But we've been through that before (here, here, here).#4: MUSCLE
This is not one of the primary dependencies, but it's an old friend. I upgraded to the latest and greatest version, but it caused an error in the QIIME tests. Luckily I still hadmuscle3.6_src
around. As usual, I put it in ~/Software
and do:ln -s ~/Software/muscle3.6_src/muscle ~/bin/muscle |
and I put
~/bin
on my $PATH
in .bash_profile
:export PATH=$HOME/bin/:$PATH |
#5: MAFFT
It's an installer. Piece of cake. Check it:> mafft |
#6 uclust
See the QIIME install notes for this. This is the work of Robert Edgar (also the creator of MUSCLE), who is a talented programmer, but also a businessman. Hence, no source code, and no 64-bit for some things. But there's a download link on the QIIME site. The only problem is that Safari put a .txt extension on the download and I thought there was a problem, went off on a wild goose chase, and got a version that is not new enough. Enough said. Put it in~/Software
and link as usual> uclust |
#7 PyNAST
(download)unpack and move to Software
python setup.py install |
After installing
MUSCLE
and MAFFT
passes all tests except:AssertionError: Got (DnaSequence(ACGTACG... 23), DnaSequence(ACGTACG... 23)), but expected (DnaSequence(ACGTACG... 23), DnaSequence(ACGTACG... 23)) |
That's pretty silly.
#8 Greengenes files
greengenes core set data file (fasta)
greengenes alignment lanemask file (txt)
Where to put them? I put them in qiime_tutorial, see below, but will eventually want them in Qiime somewhere.
#9 FastTree
fasttree 2.1.0 (src)gcc -Wall -O3 -finline-functions -funroll-loops -o FastTree -lm FastTree-2.1.0.c |
move to Software and link
#10 Java
I don't need to add the Java runtime. I got it from Apple a while ago (here):> java -version |
#11 RDP Classifier
rdp_classifier-2.0.1 (src)in
.bash_profile
:export RDP_JAR_PATH=$HOME/Software/rdp_classifier/rdp_classifier-2.0.jar |
Check for mismatch between the name of the classifier file. What I actually got was named 2.0 not 2.0.1.
#12 QIIME
That's it for the default pipeline..svn co https://qiime.svn.sourceforge.net/svnroot/qiime/trunk |
I followed their advice and put the install scripts in a special place (but I wouldn't do that again).
python setup.py install --install-scripts=~/bin/qiime/bin/ |
The problem with the special directory is you will need a
.qiime_config
file (see the docs)Make sure to put it in
~/.qiime_config
.Also not needed yet but I'll list them here anyway:
#13 BLAST
blast-2.2.22We've done that one before. The
BLASTMAT
variable must point to the NCBI data directory. In .bash_profile
:export BLASTMAT=$HOME/Software/blast-2.2.22/data |
#14 Infernal
(download)./configure |
#15 R
Run R and do:install.packages('randomForest') |
I upgraded to R 2.12.1 from 2.10.0, but it has some issue. Trying to get the ape package from CRAN hangs the app. I backed off to 2.11.1 (see here).
Still to come:
10 more bits of software.But they're not needed for the first part of the tutorial, so we should just do that first. Oh..I also have
Sphinx
installed (from the PyCogent instructions), so the documentation got built. But I've just been working from the web version anyway. On to the fun stuff.
Labels:
phylogenetics,
PyCogent,
Qiime,
simple Python,
software installs
Thursday, February 17, 2011
Fifteen
I've put up the files for the Fifteen puzzle I posted about the other day (here). This is an Xcode project (and includes a copy of the built application). It's on Dropbox (here).
You can move tiles by clicking with the mouse or using the arrow keys (much faster).
I have not solved the technical problem of eliminating unsolvable puzzles, but I've played a few times and have yet to run into one. Either my expectations are wrong or I'm just lucky. I'll try to work more on this some other time.
Wednesday, February 16, 2011
Strang's Linear Algebra: rotation & eigenvectors
The rotation matrix R is a fairly basic matrix from linear algebra. When multiplied times a vector, it rotates the vector through the angle θ: We've seen it before (here).
In Strang section 6.2 (first challenge problem), we are asked to prove the reasonable supposition:
That is, n sequential applications of the R vector should give the same result as a single rotation through an angle of
To calculate Rn we'll need the eigenvalues and eigenvectors of R. The interest of this problem comes from consideration of the following question:
what sort of vector could have the same direction after rotation by an arbitrary angle θ?
Use the standard formula:
We need the quadratic equation to find λ:
Strang gives the eigenvectors as
However, I think this is a misprint since R x1 gives:
but
We could solve this:
Without writing it out, it looks like we just need to try a change of sign:
and λ1 x1 gives the same result:
Similarly, x2 should be (i,-1) because R x2 equals:
and
Everything checks.
Now that we have the eigenvalues and eigenvectors, we need to diagonalize R. We construct the columns of S directly from the eiegenvectors:
Λ is just
To get S-1 let's try the trick for 2 x 2's:
and multipy to confirm we do get
Let's also confirm that S Λ S-1 = R. Part one is:
Step 2:
And it checks.
To do the exponentiation we just need to calculate Λn. So what is:
Strang drops another hint by reminding us of Euler's formula:
which is the whole point! But what about:
If we compare -θ with θ then:
Now all we have to do is convert back to sine and cosine:
The big computation we did above (to check that S Λ S-1 = R, is exactly the same with the substitution of nθ for θ.
We're done!
R = [ cos(θ) -sin(θ) ] [ sin(θ) cos(θ) ] |
In Strang section 6.2 (first challenge problem), we are asked to prove the reasonable supposition:
R(θ)n = R(nθ) |
That is, n sequential applications of the R vector should give the same result as a single rotation through an angle of
nθ
.To calculate Rn we'll need the eigenvalues and eigenvectors of R. The interest of this problem comes from consideration of the following question:
what sort of vector could have the same direction after rotation by an arbitrary angle θ?
Use the standard formula:
det (A - λI) = 0 [ cos(θ) - λ -sin(θ) ] [ sin(θ) cos(θ) - λ ] [cos(θ) - λ]2 + sin2(θ) = 0 λ2 - 2 λ cos(θ) + cos2(θ) + sin2(θ) = 0 λ2 - 2 λ cos(θ) + 1 = 0 |
We need the quadratic equation to find λ:
[2 cos(θ) +/- sqrt[4 cos2(θ) - 4] ] / 2 [2 cos(θ) +/- 2 sqrt[cos2(θ) - 1] ] / 2 cos(θ) +/- sqrt[cos2(θ) - 1] cos(θ) +/- sqrt[-sin2(θ)] cos(θ) +/- i sin(θ) λ1 = cos(θ) + i sin(θ) λ2 = cos(θ) - i sin(θ) |
Strang gives the eigenvectors as
x1 = (1,i) x2 = (i,1) |
However, I think this is a misprint since R x1 gives:
[ cos(θ) -sin(θ) ] [ 1 ] = [ cos(θ) - i sin(θ) ] [ sin(θ) cos(θ) ] [ i ] [ sin(θ) + i cos(θ) ] |
but
λ1 x1 = cos(θ) + i sin(θ) [ 1 ] [ i ] = [ cos(θ) + i sin(θ) ] [ - sin(θ) + i cos(θ) ] |
We could solve this:
(A - λI1) x1 = 0 (A - λI2) x2 = 0 For x1 [ -i sin(θ) -sin(θ) ] [ ? ] = [ 0 ] [ sin(θ) -i sin(θ) ] [ ? ] [ 0 ] |
Without writing it out, it looks like we just need to try a change of sign:
x1 = (-1,i) A x1 = λ1 x1 [ cos(θ) -sin(θ) ] [ -1 ] = [ - cos(θ) - i sin(θ) ] [ sin(θ) cos(θ) ] [ i ] [ - sin(θ) + i cos(θ) ] |
and λ1 x1 gives the same result:
λ1 x1 = cos(θ) + i sin(θ) [ -1 ] = [ - cos(θ) - i sin(θ) ] [ i ] [ - sin(θ) + i cos(θ) ] |
Similarly, x2 should be (i,-1) because R x2 equals:
[ cos(θ) -sin(θ) ] [ i ] = [ sin(θ) + i cos(θ) ] [ sin(θ) cos(θ) ] [ -1 ] [ - cos(θ) + i sin(θ) ] |
and
λ2 x2 = cos(θ) - i sin(θ) [ i ] [ -1 ] = [ sin(θ) + i cos(θ) ] [ - cos(θ) + i sin(θ) ] |
Everything checks.
Now that we have the eigenvalues and eigenvectors, we need to diagonalize R. We construct the columns of S directly from the eiegenvectors:
S = [ -1 i ] [ i -1 ] |
Λ is just
[ λ1 0 ] [ 0 λ2 ] |
To get S-1 let's try the trick for 2 x 2's:
S-1 = 1/2 [ -1 -i ] [ -i -1 ] |
and multipy to confirm we do get
I
:1/2 [ -1 -i ] [ -i -1 ] [ -1 i ] [ 1 0 ] [ i -1 ] [ 0 1 ] |
Let's also confirm that S Λ S-1 = R. Part one is:
[ cos(θ) + i sin(θ) 0 ] [ 0 cos(θ) - i sin(θ) ] [ -1 i ] [ -cos(θ) - i sin(θ) sin(θ) + i cos(θ) ] [ i -1 ] [ -sin(θ) + i cos(θ) -cos(θ) + i sin(θ) ] |
Step 2:
1/2 [ -1 -i ] [ -i -1 ] [ -cos(θ) - i sin(θ) sin(θ) + i cos(θ) ] [ a b ] [ -sin(θ) + i cos(θ) -cos(θ) + i sin(θ) ] [ c d ] a = 1/2 [ cos(θ) + i sin(θ) - i sin(θ) + cos(θ) ] = cos(θ) b = 1/2 [ i cos(θ) - sin(θ) - sin(θ) - i cos(θ) ] = -sin(θ) c = 1/2 [ sin(θ) - i cos(θ) + i cos(θ) + sin(θ) ] = sin(θ) d = 1/2 [ i sin(θ) + cos(θ) + cos(θ) - i sin(θ) ] = cos(θ) |
And it checks.
To do the exponentiation we just need to calculate Λn. So what is:
λ1n = [ cos(θ) + i sin(θ) ]n λ2n = [ cos(θ) - i sin(θ) ]n |
Strang drops another hint by reminding us of Euler's formula:
eiθ = cos(θ) + i sin(θ) (eiθ)n = einθ |
which is the whole point! But what about:
λ2n = [ cos(θ) - i sin(θ) ]n |
If we compare -θ with θ then:
cos(-θ) = cos(θ) sin(-θ) = -sin(θ) cos(θ) - i sin(θ) = cos(-θ) + i sin(-θ) = e-iθ (e-i&theta)n = e-in&theta |
Now all we have to do is convert back to sine and cosine:
λ1n = cos(nθ) + i sin(nθ) λ2n = cos(nθ) - i sin(nθ) |
The big computation we did above (to check that S Λ S-1 = R, is exactly the same with the substitution of nθ for θ.
R(θ)n = [ cos(nθ) -sin(nθ) ] [ sin(nθ) cos(nθ) ] |
We're done!
Tuesday, February 15, 2011
A permutations puzzle
wikimedia
I wrote a very simple game in PyObjC---it's a "fifteen" puzzle. The goal of the game is to get all 15 tiles in order with the empty space in the lower right hand corner. This can be viewed as the permutation:
1, 2, 3 .. 13, 14, 15, 0 |
I had a vague recollection that some such puzzles are unsolvable, and it's fairly easy to see by just playing that one cannot achieve this order using the standard layout:
1, 2, 3 .. 13, 15, 14, 0 |
And conversely, if you were presented with the modified layout, the puzzle would not be solvable. Since it seems really important not to give the user an unsolvable puzzle, I looked into it some more and found that permutations (which is what this problem is about) have parity.
Suppose we take the following permutation of 1..5:
3 4 5 2 1 |
How many swaps
(u,v = v,u)
does it take to put the numbers into ascending order? This sequence of three works: (13)(35)(24) |
So we can think of this permutation of the sequence as having an odd parity. One question I don't know the answer to is how you would prove that there does not exist some longer sequence of swaps that also reaches the "correct" order, but has even parity. Probably, if I studied the article it is explained there.
A second measure I saw for parity is to take each member of the sequence in turn and compare it with all the members at a higher index:
for i in range(len(L)-1): |
We count +1 if the ith element is out of (ascending) order with respect to the jth element, and define the total count of such mismatched pairs as a different measure of parity.
This is the output from such an analysis on all permutations of the sequence 0, 1, 2, 3, showing the permutation, then the results of the swap test, followed by the order test:
0 1 2 3 0 0 |
The number of swaps required to achieve sorted order is never more than 3, while the reverse order has a maximum of 6 comparisons giving an incorrect order. I notice that these two measures seem to be either both odd or both even, and confirmed that result by testing all permutations of range(10) and random samples from range(25).
Looking ahead, the idea will be that the moves in the puzzle, and particularly those which really matter (vertical ones), don't change the parity, but so far I am not getting the expected answer with that.
from itertools import permutations as perms |
Sunday, February 13, 2011
The slope of the sine curve
Just to finish with the subject from last time (here), let's show that the slope of the sine curve at
We start from
The first term is
The second term is
Since sine and cosine are periodic with cosine "ahead"
If
Let
Then
x
is cos(x)
and for the cosine curve at x it is -sin(x)
.We start from
x
, and then move a little bit h
. Using the rule for sum of sines (here):sin(x + h) = sin(x) cos(h) + cos(x) sin(h) |
The first term is
-sin(x)
times (1/h)(1 - cos(h))
; last time we showed that (1/h)(1 - cos(h))
equals zero in the limit as h -> 0
.The second term is
cos(x)
times (1/h) sin(h)
; we showed that (1/h) sin(h)
approaches 1
as h -> 0
. Thus, d/dx sin(x) = cos(x) |
Since sine and cosine are periodic with cosine "ahead"
sin(x + π/2) = cos(x) |
If
y = cos(x) = |
Let
u = x + π/2 |
Then
d/dx cos(x) = |
Know your limits
To find the derivative of the sine function, we need to determine two limits. The first is:
In the figure,
Kline sets up this inequality:
The last result follows since
As long as we do not actually reach
As
Strang does the first part by extending the segment AB (dotted line) and noting that
The other thing we need is:
Following Strang:
Since
Divide by
Strang says, note that all terms are positive, so that:
Now, as as
Here is a simple Python script to calculate the same values. Notice that although
and the listing:
lim (h -> 0) sin(h)/h |
In the figure,
h
is the measure of the angle and also the length of the arc of the circle between A
and D
(in radians). (I'm using h
instead of x
because ultimately this will be a small change to x
, as in x + h
).Kline sets up this inequality:
area OAB < area OAD < area OED |
The last result follows since
OAB
and OED
are similar triangles so that ED/OD = AB/OB
, and OD = 1
. Thus, the inequality becomes:(1/2) cos(h) sin(h) < (1/2) h < (1/2) sin(h)/cos(h) |
As long as we do not actually reach
h = 0
, we are allowed to divide by sin(h)
:cos(h) < h/sin(h) < 1/cos(h) |
As
h
approaches 0, cos(h)
and its inverse 1/cos(h)
both approach 1 so the ratios h/sin(h)
and sin(h)/h
both approach 1 as well.Strang does the first part by extending the segment AB (dotted line) and noting that
2 AB < 2 h |
The other thing we need is:
lim (h -> 0) (1 - cos(h))/h |
Following Strang:
sin2(h) + cos2(h) = 1 |
Since
sin(h) < h
(see the figure again):(1 + cos(h)) (1 - cos(h)) = sin2(h) < h2 |
Divide by
h
and also by 1 + cos(h
):(1 - cos(h))/h < h/(1 + cos(h)) |
Strang says, note that all terms are positive, so that:
0 < (1 - cos(h))/h < h/(1 + cos(h)) |
Now, as as
h -> 0
the right side must go to 0/2
(that is, 0
) and our ratio is "caught in the middle."Here is a simple Python script to calculate the same values. Notice that although
1-cos(h)
and h
are both approaching 0, h is approaching more slowly, so the ratio tends to the 0 in the limit. First the output:h 1.0000000000, sin(h) 0.8414709848, h/sin(h) 0.8414709848 |
and the listing:
from __future__ import division |
Thursday, February 10, 2011
A plot of the quadratic in Python
I have another simple matplotlib example, based on the parabola post from the other day (here). The conclusion from that post was that any function:
y(x) = ax2 + bx + c |
can be manipulated to the form:
y - y0 = a(x - x0)2 |
That is, the parabola is just
y = a(x)2
translated to have a different vertex. I didn't do a systematic examination or anything, but here is one example in Python.The function
quadratic
takes arguments a, b, and c
and returns the vertex (x0, y0)
as well as a numpy array containing x
such that |x - x0| <= 4
, and a second one containing f(x)
for each value of x
.We plot two sets of three parabolas, each set has one for each of
a = 1, 2 and 3
. One set is at the origin (cyan, blue and purple). The second set (magenta, salmon and red) has c = 3
and:a = 1; b = 2 |
Since
x0 = -b/2a
is the same for each parabola in the second set, they all have the same axis of symmetry. The only difference (besides the shape parameter a
) is y0
, which can be calculated either from plugging a, b, c and x0
into the standard form, or by using the fact thaty0 = b2/4a - b2/2a + c |
Using the second method, I get:
a = 1; b = 2; y0 = -1 + 3 = 2 |
We ought to be able to solve for b and c to put the parabola anywhere on the x,y-plane..
UPDATE:
x0 = -b/2a |
from __future__ import division |
Subscribe to:
Posts (Atom)