Python for Bioinformatics: Dental project (1)

Wednesday, March 9, 2011

Dental project (1)

I'd like to spend a few posts talking about a project we just completed on surveying the bacterial species present in patients with gingivitis compared to normal controls. The main reason is to exercise QIIME with a different data set, but I'd also like to say a bit about the project.

I checked this morning at PubMed and found the paper has come out:

Olson 2011 PMID 21362199

Click on the link to download, 7.6 MB, that's just about the largest file size for a paper that I ever saw. Opening it up, I see why. It's still in manuscript form, and some of the figures are quite big. To play with this, we'll need to get the sequences. Luckily they were just posted by Genbank the other day.

I wrote a script to grab the sequences in chunks of 40, with a timer to sleep for 10 seconds between requests. The first sign of trouble was here:

HQ895465 HQ895504 1
URL Error
HQ895505 HQ895544 1
..

but eventually, we did another request for this batch which looked like it worked:

HQ895465 HQ895504 2

but actually, the file contains this near the end:

PubseqAccess cmd(id_gi_by_word HQ895505) failed with Cannot get server name from load balancer &apos;PUBSEQ_OS_PUBLIC&apos; : errmsg=&apos;Service not found&apos; HQ895505

and then more of the same. Looking at the sequences, it seems they cut us off with 1000 sequences.. stopping with HQ895464.1

I thought this should be OK. It's very early in the morning, with more than 3 seconds between requests, but apparently we ran up against some kind of limit.

I give the server some time to calm down, (and change the name of the file we've written to), edit the list and try again:

> python fetchSeqs.py 
HQ895465 HQ895504 1
HQ895505 HQ895544 1
HQ895545 HQ895584 1
HQ895585 HQ895588 1

then combine by hand.. Next time we'll take a look at them.

fetchSeqs.py

import urllib2, sys, time
from utils import load_data

ncbi = 'http://eutils.ncbi.nlm.nih.gov'
eutils = ncbi + '/entrez/eutils/'
efetch = 'efetch.fcgi?'

def chunks(L,SZ):
    rL = list()
    while L:
        rL.append(L[:SZ])
        L = L[SZ:]
    return rL

def fetch(L):
    s = eutils + efetch
    s += 'id=' + L
    s += '&db=nucleotide&rettype=fasta&retmode=text'
    try:
        FH = urllib2.urlopen(s)
        data = FH.read()
    except:
        raise ValueError('URL Error')
    if 'NCBI C++' in data:
        raise ValueError('Empty')
    elif not data:
        raise ValueError('Empty')
    return data
    
def run(FH):
    first = 894465
    #first = 895465
    last = 895588
    L = ['HQ' + str(n) for n in range(first,last+1)]
    rL = list()
    L = chunks(L,SZ=40)
    L = [(e,1) for e in L]
    while L:
        sL,n = L.pop(0)
        print sL[0], sL[-1], n
        if n > 3:  continue
        try:
            s = fetch(','.join(sL))
            FH.write(s)
        except ValueError as e:
            print e
            L.append((sL,n+1))
        time.sleep(10)
    
if __name__ == '__main__':
    FH = open('results.txt','w')
    L = run(FH)
    FH.close()