Wednesday, December 23, 2009

PyCogent 15: BLAST

I'm exploring PyCogent (docs, download page)---first post here.

If you've done anything in Bioinformatics, you've probably run a BLAST search at the NCBI website.

NCBI also makes the same programs available in a form that can run locally on a variety of platforms (e.g. for OS X). These are particularly easy to use because they are pre-compiled and simply need to be downloaded to use. The usual reason to do this is to run searches against a user-defined database or to run a very large number of searches.

PyCogent includes an Application Controller for BLAST as well as many other programs. The "promise" of PyCogent is that it will provide a standardized interface to the command line version of BLAST. That is, it should be more flexible than just doing it directly (I'll explain the details below):

blastall -i temp/inseqs.fasta -p blastn -m 9 -d ~/Desktop/temp/refseqs.fasta

And it should be simpler than using Python's built-in module subprocess, which is already pretty easy:

import os, subprocess

def blastall(db,fn,ofn,howmany=10):
cmd = 'blastall'
cmd += ' -i ' + fn
cmd += ' -p ' + 'blastn'
cmd += ' -d ' + db
cmd += ' -o ' + ofn
cmd += ' -m 7 '
cmd += ' -b ' + str(howmany) # ALIGNMENTS
p = subprocess.Popen(cmd, shell=True)
pid,ecode = os.waitpid(, 0)
return ecode

BLAST has recently been updated to provide a new suite of programs (BLAST+), and I'll deal with that issue later. For the moment, let's look at the latest versions of the classic or "legacy" software (v 2.2.22 as of today). You can grab them from here.

The following instructions (like everything on this blog) are for OS X.

While any program can be invoked using its full path, it's helpful to provide a link that is available on your $PATH. I did this in two steps. I have a file in my home directory (a hidden file) named .bash_profile that contains this line:

PATH=$PATH:$HOME/bin/ ;export PATH

And I made symbolic links to two of the executables like this:

ln -s ~/Software/blast/programs/blast-2.2.22/bin/blastall ~/bin/blastall
ln -s ~/Software/blast/programs/blast-2.2.22/bin/formatdb ~/bin/formatdb

Since the ~/bin directory is on my $PATH, this now works from the command line to display the BLAST help menu:

blastall --help

Almost any BLAST run will involve specifying an input file with the query sequence(s), an output file to save the results, and a database file to search against. For protein searches the substitution matrix and its path will also need to be specified. With a one-time search it is easiest to do all of this in the command we provide to execute the program.

But the first required step is to pre-process the database sequences. The program needed (formatdb) is in the folder with the other BLAST programs. A folder on my Desktop contains a single file with some fake database sequences (temp/refseqs.fasta):


It can be formatted by doing this:

formatdb -i temp/refseqs.fasta -p F

The formatdb program will write a logfile (formatdb.log) to the directory from which you execute the above command, or you can specify the location to write it. I can run it with the command formatdb (since it's on my path) and provide two parameters: a relative path to the database, and -p F, which designates that protein = False. This is required since the default is T.

Alternatively, you can use PyCogent (here, from the interpreter):

>>> from import build_blast_db_from_fasta_path
>>> result = build_blast_db_from_fasta_path('temp/refseqs.fasta')

The variable result contains a list of the files that were written (plus the input sequence file).

Command-line BLAST invocation

BLAST can be run quite simply from the command-line. If we put the following sequence in a file in the temp directory (temp/inseqs.fasta):


Then, from the command line:

blastall -i temp/inseqs.fasta -p blastn -m 9 -d ~/Desktop/temp/refseqs.fasta

# BLASTN 2.2.22 [Sep-27-2009]
# Query: s2_like_seq
# Database: /Users/te/Desktop/temp/refseqs.fasta
# Fields: Query id, Subject id, % identity, alignment length, mismatches, gap openings, q. start, q. end, s. start, s. end, e-value, bit score
s2_like_seq s2 89.66 29 2 1 1 28 1 29 6e-05 26.3
s2_like_seq s3 100.00 12 0 0 1 12 1 12 2e-04 24.3
s2_like_seq s1 100.00 12 0 0 1 12 1 12 2e-04 24.3

Another straightforward way is to use Python's subprocess module. For example, I put the code at the top of the post into a script. Now, just call the function blastall and provide paths to the sequence file, the database, and the file to write results. The -m flag above specifies output format. The help lists more options.

  -m  alignment view options:
0 = pairwise,
1 = query-anchored showing identities,
2 = query-anchored no identities,
3 = flat query-anchored, show identities,
4 = flat query-anchored, no identities,
5 = query-anchored no identities and blunt ends,
6 = flat query-anchored, no identities and blunt ends,
7 = XML Blast output,
8 = tabular,
9 tabular with comment lines
10 ASN, text
11 ASN, binary [Integer]
default = 0
range from 0 to 11

Finally, we can also use PyCogent to do this. The idea is that PyCogent will provide a standardized interface for Application Controllers. Unfortunately, it's not quite there yet. In some more complicated cases, a "convenience" function is often available, as used in the code below.

PyCogent (again, from the interpreter):

>>> from import blast_seqs, Blastall
>>> from cogent import LoadSeqs, DNA
>>> seqs = LoadSeqs(filename='temp/inseqs.fasta', aligned=False)
>>> seq = seqs.getSeq('s2_like_seq')
>>> seq
ByteSequence(TGCAGCT... 28)
>>> params={'-p':'blastn','-n':'F','-m':'9'}
>>> data_path = '~/bin/blast/programs/blast-2.2.22/data'
>>> result = blast_seqs([seq],
... Blastall,
... blast_db = 'temp/refseqs.fasta',
... blast_mat_root = data_path,
... params = params)
>>> data = result['StdOut'].read()
>>> print data
# BLASTN 2.2.22 [Sep-27-2009]
# Query: 1
# Database: temp/refseqs.fasta
# Fields: Query id, Subject id, % identity, alignment length, mismatches, gap openings, q. start, q. end, s. start, s. end, e-value, bit score
1 s2 89.66 29 2 1 1 28 1 29 6e-05 26.3
1 s3 100.00 12 0 0 1 12 1 12 2e-04 24.3
1 s1 100.00 12 0 0 1 12 1 12 2e-04 24.3

One last point. In the PyCogent example we need a parameter named blast_mat_root. This is the path to the data directory that contains substitution matrices. It isn't actually used (since we're dealing with DNA sequences), but blast_seqs doesn't seem to know that.

It might be nice not to have to specify the path to the database every time we run blast. The suggested way to do this is to provide a file named .ncbirc in my home directory. I didn't find the instructions very clear, but I think I got it figured out. I'll have more to say about that next time.


Greg said...

One note -- you say that:

"The variable result contains a list of the files that were written (plus the input sequence file)."

The 'input sequence file' that is returned is actually the blast database name that would be passed to as 'blast_db' or to the blastall command line application via -d. So, the way I typically use build_blast_db_from_fasta_path is as follows:

from import build_blast_db_from_fasta_path
from import blast_seqs
from cogent.util.misc import remove_files

blast_db, files_to_remove = build_blast_db_from_fasta_path('temp/refseqs.fasta')

result = blast_seqs([seq], Blastall, blast_db = blast_db, blast_mat_root = data_path, params = params)


This will build a temporary database, blast seq against it, and then remove all of the files that formatdb created. Users can optionally leave off the last step if they want to keep the blast database that was created.

telliott99 said...

Thanks, Greg. I should have been more specific, and I missed something too...

What is returned is a tuple whose first element is a string corresponding to the path passed into this function


and whose second element is a list of strings that are the paths to the files written by formatdb. So, for a one-off db, you can clean up by removing all the files at the paths in the list returned as the second element of the tuple.