I need to collect sequences from NCBI for a small database of 16S rRNA gene sequences from bacteria of interest. Many of these are from clonal isolates so the FASTA-formatted sequences have title lines like
I can parse that title, or I also have the Genbank identifier in a database entry like this:
I grab the corresponding sequence using PyCogent (as discussed in the post here):
However, I also want to get some rRNA sequences from genome projects. In that case the FASTA-formated RefSeq record is huge, and I just want a small part of it.
From the prokaryote genome projects page at NCBI, I click on the RefSeq link for E. coli MG1655, and on the RefSeq page I click on Structural RNAs
I find what I'm looking for halfway down the page:
16S ribosomal RNA of rrnB operon
I copy the link for "DNA region in flatfile format" (the green diamond)
I tried feeding the range information (from and to) to EFetch but got nowhere. And I'm not sure if this is the best way, but here is what I did. I browsed through
I just added a couple of keys to the dict which is used to construct the url. In cogent.utils.ncbi after line 133 I inserted:
Now this works:
Genbank is smart enough to figure out that the id we sent is not a real gi but refers to a record in the RefSeq database. Rather than using from and to, you'd think it would be OK to use something like:
but I couldn't get it past EFetch, even with URL encodings. Guess I should run this by the PyCogent crew for issues. Maybe they'll want to put it in the toolkit.