When I retrieve a set of sequences from Genbank, entries are separated by a single space. The sequence is upper case with 70 characters per line. This is useful because if we read the data as text, we can get the individual entries by splitting on two consecutive newlines. The following code uses the urllib2 modeule from the standard library, and obtains two sequences from Genbank:

Usually I will then remove all the newlines (with split()) and reformat at the desired line length. I wrote a simple function to do that:

There are certainly more sophisticated ways to handle FASTA format. Biopython has a parser that works like this (after doing from Bio import Fasta):

This morning I came across a library that I didn't know about previously, which handles interaction with NCBI via eutils. I plan to post on my simple-minded code for scripting with eutils, and then later when I have time I'll take a look at the new library and report back. I also need to work on my picture formatting skills, so the code snapshots will all be the same size. :)