Python for Bioinformatics: Fetching sequences from NCBI

Tuesday, February 22, 2011

Fetching sequences from NCBI

Here's the problem I want to solve:

I need to collect sequences from NCBI for a small database of 16S rRNA gene sequences from bacteria of interest. Many of these are from clonal isolates so the FASTA-formatted sequences have title lines like

>gi|29290057|gb|AY207063.1| Abiotrophia sp. oral clone P4PA_155 P1 16S ribosomal RNA gene, partial sequence

I can parse that title, or I also have the Genbank identifier in a database entry like this:

AY207063.1     Abiotrophia_clone_P4PA_155

I grab the corresponding sequence using PyCogent (as discussed in the post here):

from cogent.db.ncbi import EFetch
ef = EFetch(id='AY207063')
print ef.read()

> python fetch.py 
>gi|29290057|gb|AY207063.1| Abiotrophia sp. oral clone P4PA_155 P1 16S ribosomal RNA gene, partial sequence
GACGAACGCTGGCGGTGTGCCTAATACATGCAAGTCGAACGAACCGCGACTAGGTGCTTGCACTTGGTCA
..

However, I also want to get some rRNA sequences from genome projects. In that case the FASTA-formated RefSeq record is huge, and I just want a small part of it.

From the prokaryote genome projects page at NCBI, I click on the RefSeq link for E. coli MG1655, and on the RefSeq page I click on Structural RNAs

I find what I'm looking for halfway down the page:

16S ribosomal RNA of rrnB operon

I copy the link for "DNA region in flatfile format" (the green diamond)

http://www.ncbi.nlm.nih.gov/nuccore/NC_000913?from=4164682&to=4166223

I tried feeding the range information (from and to) to EFetch but got nowhere. And I'm not sure if this is the best way, but here is what I did. I browsed through

class EFetch(UrlGetter) in
/Library/Python/2.6/site-packages/cogent/db/ncbi.py

class UrlGetter(object) in
/Library/Python/2.6/site-packages/cogent/db/util.py

I just added a couple of keys to the dict which is used to construct the url. In cogent.utils.ncbi after line 133 I inserted:

        'from', 'to',

Now this works:

from cogent.db.ncbi import EFetch
D = {'id':'NC_000913','from':'4164682','to':'4166223'}
ef = EFetch(**D)
print str(ef) + '\n'
print ef.read()

> python fetch.py 
http://www.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?retmax=100&from=4164682&retmode=text&tool=PyCogent&db=nucleotide&email=Michael.Robeson%40colorado.edu&to=4166223&rettype=fasta&retstart=0&id=NC_000913

>gi|49175990:4164682-4166223 Escherichia coli str. K-12 substr. MG1655 chromosome, complete genome
AAATTGAAGAGTTTGATCATGGCTCAGATTGAACGCTGGCGGCAGGCCTAACACATGCAAGTCGAACGGT
..

Genbank is smart enough to figure out that the id we sent is not a real gi but refers to a record in the RefSeq database. Rather than using from and to, you'd think it would be OK to use something like:

NC_000913:4164682-4166223

but I couldn't get it past EFetch, even with URL encodings. Guess I should run this by the PyCogent crew for issues. Maybe they'll want to put it in the toolkit.