Tuesday, February 22, 2011

Fetching sequences from NCBI

Here's the problem I want to solve:

I need to collect sequences from NCBI for a small database of 16S rRNA gene sequences from bacteria of interest. Many of these are from clonal isolates so the FASTA-formatted sequences have title lines like

>gi|29290057|gb|AY207063.1| Abiotrophia sp. oral clone P4PA_155 P1 16S ribosomal RNA gene, partial sequence

I can parse that title, or I also have the Genbank identifier in a database entry like this:

AY207063.1     Abiotrophia_clone_P4PA_155

I grab the corresponding sequence using PyCogent (as discussed in the post here):

from cogent.db.ncbi import EFetch
ef = EFetch(id='AY207063')
print ef.read()

> python fetch.py 
>gi|29290057|gb|AY207063.1| Abiotrophia sp. oral clone P4PA_155 P1 16S ribosomal RNA gene, partial sequence

However, I also want to get some rRNA sequences from genome projects. In that case the FASTA-formated RefSeq record is huge, and I just want a small part of it.

From the prokaryote genome projects page at NCBI, I click on the RefSeq link for E. coli MG1655, and on the RefSeq page I click on Structural RNAs

I find what I'm looking for halfway down the page:

16S ribosomal RNA of rrnB operon

I copy the link for "DNA region in flatfile format" (the green diamond)


I tried feeding the range information (from and to) to EFetch but got nowhere. And I'm not sure if this is the best way, but here is what I did. I browsed through

class EFetch(UrlGetter) in

class UrlGetter(object) in

I just added a couple of keys to the dict which is used to construct the url. In cogent.utils.ncbi after line 133 I inserted:

        'from', 'to',

Now this works:

from cogent.db.ncbi import EFetch
D = {'id':'NC_000913','from':'4164682','to':'4166223'}
ef = EFetch(**D)
print str(ef) + '\n'
print ef.read()

> python fetch.py 

>gi|49175990:4164682-4166223 Escherichia coli str. K-12 substr. MG1655 chromosome, complete genome

Genbank is smart enough to figure out that the id we sent is not a real gi but refers to a record in the RefSeq database. Rather than using from and to, you'd think it would be OK to use something like:


but I couldn't get it past EFetch, even with URL encodings. Guess I should run this by the PyCogent crew for issues. Maybe they'll want to put it in the toolkit.

No comments: