Wednesday, May 7, 2008

Simple FASTA manipulation

As you know, the FASTA format is a text format for sequences consisting of a single-line description which starts with '>' and is then followed by the sequence. Multiple entries are allowed in a single file. The sequence part of the entry may contain newlines or not. There may be a single newline between FASTA entries or one or more blank lines. For DNA, the commonly used symbols are A, C, G, T and N, but there are other legal characters as well. Object oriented approaches (like Biopython) will turn a DNA sequence into an object, with methods and so on, but I prefer to deal with sequences as simple strings.

When I retrieve a set of sequences from Genbank, entries are separated by a single space. The sequence is upper case with 70 characters per line. This is useful because if we read the data as text, we can get the individual entries by splitting on two consecutive newlines. The following code uses the urllib2 modeule from the standard library, and obtains two sequences from Genbank:

Usually I will then remove all the newlines (with split()) and reformat at the desired line length. I wrote a simple function to do that:

There are certainly more sophisticated ways to handle FASTA format. Biopython has a parser that works like this (after doing from Bio import Fasta):

This morning I came across a library that I didn't know about previously, which handles interaction with NCBI via eutils. I plan to post on my simple-minded code for scripting with eutils, and then later when I have time I'll take a look at the new library and report back. I also need to work on my picture formatting skills, so the code snapshots will all be the same size. :)