I checked this morning at PubMed and found the paper has come out:
Olson 2011 PMID 21362199
Click on the link to download, 7.6 MB, that's just about the largest file size for a paper that I ever saw. Opening it up, I see why. It's still in manuscript form, and some of the figures are quite big. To play with this, we'll need to get the sequences. Luckily they were just posted by Genbank the other day.
I wrote a script to grab the sequences in chunks of 40, with a timer to sleep for 10 seconds between requests. The first sign of trouble was here:
but eventually, we did another request for this batch which looked like it worked:
but actually, the file contains this near the end:
and then more of the same. Looking at the sequences, it seems they cut us off with 1000 sequences.. stopping with HQ895464.1
I thought this should be OK. It's very early in the morning, with more than 3 seconds between requests, but apparently we ran up against some kind of limit.
I give the server some time to calm down, (and change the name of the file we've written to), edit the list and try again:
then combine by hand.. Next time we'll take a look at them.
fetchSeqs.py