Monday, May 12, 2008

Using formatdb

There are several reasons to run BLAST locally rather than over the internet. One is speed. However, the size of the non-redundant (nr) database is enormous (zipped files total ≈10 GB). For 1E3 or even 1E4 runs a day, it makes a lot of sense to let them maintain the database and do BLAST over the web. Particularly when one can bundle queries (I don't know the limits for this, but 6 rRNA sequences works fine). It says somewhere in the docs that you can submit a new job once you receive the 'RID.' Hey, just run the job overnight.

A more common use is to search a custom-made database. A database is simply a text file of FASTA formatted sequences. In order to use one, it must first be processed by the formatdb program provided with the BLAST programs. After formatdb runs, it creates three new files: *.nhr, *.nin and *.nsq. The original database file is no longer needed afterward.

The documentation that comes with formatdb looks forbidding. The program is sophisticated, but its use doesn't have to be complicated. I do not use a configuration (.formatdbrc) file. The following program employs a hard-coded path to the binary. One subtle point about file paths: the shell understands '~' to mean the user's home directory, but formatdb does not. You have to use the full path. Here I obtain that using a Foundation function called NSHomeDirectory(). If the program does not run for this reason, the logfile contains no hint about why. It's interesting that the blast programs can handle '~' correctly.

Minimal flags passed to formatdb include:

-i  path to database file
-p  protein? F
-o  parse SeqId and do fancy stuff? F
-l  path to write logfile

Here is a very nice tutorial that covers the whole process of installing BLAST locally.

import os, sys, subprocess
from Foundation import NSHomeDirectory

def run():
    prog = '~/bin/blast'
    prog += '/programs/blast-2.2.18/bin/formatdb'

    home = NSHomeDirectory()

    directory = home + '/Desktop/analysis.seq/db.all/'
    cmd = prog + ' -i' + directory + 'db.txt'
    cmd += ' -p F -o F -l '
    cmd += directory + 'logfile.txt'

    p = subprocess.Popen(cmd, shell=True)
    sts = os.waitpid(p.pid, 0)

if __name__ == "__main__":
run()

1 comment:

Nicholi said...

Hi, I am a student at the University of Northern Iowa, I like this blog it is clear and to the point. I wish I had found it sooner so that I didn’t need to figure all this out on my own. I do have one question for you, I am trying to run a program called CPTRA, it runs in Python and it requires one; Reference cDNA sequence file, input file to be formatted with formatdb. The problem I have now is formadb outputs 3 to 5 files depending on how you use it, and I don’t know how to (or which one to) in put it back in to CPTRA. Any help would be much appreciated my email is nicholipitra@gmail.com

Also a link to CPTRA if you want to check it out

http://people.tamu.edu/~syuan/cptra/cptra.html

thank you,
nicholi pitra