If you've done anything in Bioinformatics, you've probably run a BLAST search at the NCBI website.
NCBI also makes the same programs available in a form that can run locally on a variety of platforms (e.g. for OS X). These are particularly easy to use because they are pre-compiled and simply need to be downloaded to use. The usual reason to do this is to run searches against a user-defined database or to run a very large number of searches.
PyCogent includes an Application Controller for BLAST as well as many other programs. The "promise" of PyCogent is that it will provide a standardized interface to the command line version of BLAST. That is, it should be more flexible than just doing it directly (I'll explain the details below):
And it should be simpler than using Python's built-in module subprocess, which is already pretty easy:
BLAST has recently been updated to provide a new suite of programs (BLAST+), and I'll deal with that issue later. For the moment, let's look at the latest versions of the classic or "legacy" software (v 2.2.22 as of today). You can grab them from here.
The following instructions (like everything on this blog) are for OS X.
While any program can be invoked using its full path, it's helpful to provide a link that is available on your $PATH. I did this in two steps. I have a file in my home directory (a hidden file) named
.bash_profile
that contains this line:PATH=$PATH:$HOME/bin/ ;export PATH
And I made symbolic links to two of the executables like this:
Since the
~/bin
directory is on my $PATH
, this now works from the command line to display the BLAST help menu:Almost any BLAST run will involve specifying an input file with the query sequence(s), an output file to save the results, and a database file to search against. For protein searches the substitution matrix and its path will also need to be specified. With a one-time search it is easiest to do all of this in the command we provide to execute the program.
But the first required step is to pre-process the database sequences. The program needed (
formatdb
) is in the folder with the other BLAST programs. A folder on my Desktop contains a single file with some fake database sequences (temp/refseqs.fasta
): It can be formatted by doing this:
formatdb -i temp/refseqs.fasta -p F
The
formatdb
program will write a logfile (formatdb.log
) to the directory from which you execute the above command, or you can specify the location to write it. I can run it with the command formatdb
(since it's on my path) and provide two parameters: a relative path to the database, and -p F
, which designates that protein = False
. This is required since the default is T
.Alternatively, you can use PyCogent (here, from the interpreter):
The variable
result
contains a list of the files that were written (plus the input sequence file).Command-line BLAST invocation
BLAST can be run quite simply from the command-line. If we put the following sequence in a file in the temp directory (
temp/inseqs.fasta
):>s2_like_seq
TGCAGCTTGAGCACAGGTTAGAGCCTTC
Then, from the command line:
Another straightforward way is to use Python's
subprocess
module. For example, I put the code at the top of the post into a script. Now, just call the function blastall
and provide paths to the sequence file, the database, and the file to write results. The -m
flag above specifies output format. The help lists more options.Finally, we can also use PyCogent to do this. The idea is that PyCogent will provide a standardized interface for Application Controllers. Unfortunately, it's not quite there yet. In some more complicated cases, a "convenience" function is often available, as used in the code below.
PyCogent (again, from the interpreter):
One last point. In the PyCogent example we need a parameter named
blast_mat_root
. This is the path to the data directory that contains substitution matrices. It isn't actually used (since we're dealing with DNA sequences), but blast_seqs
doesn't seem to know that.It might be nice not to have to specify the path to the database every time we run blast. The suggested way to do this is to provide a file named
.ncbirc
in my home directory. I didn't find the instructions very clear, but I think I got it figured out. I'll have more to say about that next time.