Saturday, December 26, 2009

Pycogent 16: BLAST parsers

I'm exploring PyCogent (docs, download page)---first post here.

Continuing with the topic from the other day, processing of BLAST output, I was confused about usage of the blast parsers. As defined in cogent.parse.blast:

def MinimalBlastParser9(lines, 

and in cogent.parse.blast_xml:

def MinimalBlastParser7(lines, 

The first argument is named 'lines', but despite that (as suggested in comments), these functions are looking for a file object. That is, what we're going to do (after we do the BLAST, below) is this:

from cogent.parse.blast import MinimalBlastParser9
blastfile = open('blast_test.txt', 'r')
blast_results = MinimalBlastParser9(blastfile)
L = list(blast_results)

L will then contain one item for each query sequence.

That item is a tuple with two elements
• a dictionary with meta-data
• a list of sub-lists

Each sub-list contains the data for an Hsp.

I'll do the test with a different db containing Achromobacter 16S rRNA seqs.


You could grab these as described here. First, we have to format the db for BLAST.

from import build_blast_db_from_fasta_path
result = build_blast_db_from_fasta_path('temp/achromo.fasta')

blastall \
-i ~/Desktop/temp/achromo_seq.fasta -p blastn -m 9 \
-d ~/Desktop/temp/achromo.fasta \
-o ~/Desktop/blast_test.txt

This data is actually simple enough that you don't really need a parser. I will do this analysis from the interpreter, but I'm not showing the interpreter prompt '>>> ').

from cogent.parse.blast import MinimalBlastParser9
FH = open('blast_test.txt', 'r')
data = FH.readlines()
print data[3].split()[:6]
['#', 'Fields:', 'Query', 'id,', 'Subject', 'id,']
for line in data[4:10]:
print line.split()[:4]
['AF411020.1', 'AF411020.1', '100.00', '630']
['AF411020.1', 'AF411019.1', '99.84', '630']
['AF411020.1', 'DQ450530.1', '99.68', '630']
['AF411020.1', 'EU373389.1', '99.37', '630']
['AF411020.1', 'AJ278451.1', '99.21', '630']
['AF411020.1', 'EF672258.1', '99.05', '630']

We parse as follows:

from cogent.parse.blast import MinimalBlastParser9
blastfile = open('blast_test.txt', 'r')
blast_results = MinimalBlastParser9(blastfile)
L = list(blast_results)

for t in L:
metaD, subL = t
fields = metaD['FIELDS'].split(',')
fields = [f.strip() for f in fields]
for e in subL:
for k,v in zip(fields,e):
print k.ljust(25), v

Data for the first hit looks like this:

Query id                  AF411020.1
Subject id AF411020.1
% identity 100.00
alignment length 630
mismatches 0
gap openings 0
q. start 1
q. end 630
s. start 1
s. end 630
e-value 0.0
bit score 1249

Note that this format does not contain the aligned sequences.

Now, let's do it with XML-formatted ouptut:

blastall \
-i ~/Desktop/temp/achromo_seq.fasta -p blastn -m 7 \
-d ~/Desktop/temp/achromo.fasta \
-o ~/Desktop/blast_test.xml

from cogent.parse.blast_xml import MinimalBlastParser7
blastfile = open('blast_test.xml', 'r')
blast_results = MinimalBlastParser7(blastfile)
L = list(blast_results)

Again, L contains one item for each query sequence. That item is a tuple with two elements

• a dictionary with meta-data
• a list of sub-lists

Each sub-list is the data for an Hsp, whose first element is a list of keys. This is a slightly different arrangement than for the first example.

blast_stuff, subL = L[0]
kL = subL[0]
hit1 = subL[1]
for k,v in zip(kL,hit1):
if 'ALIGN' in k: continue
print k.ljust(20), v

QUERY ID             1
SUBJECT_ID lcl|AF411020.1
HIT_DEF No definition line found
BIT_SCORE 1249.38

print hit1[kL.index(k)][:40]


And while the BLAST parsers have an input argument labeled data but want a file object, the Fasta parser has input arg labeled infile and wants a string! (and note: you must do readlines() rather than read()).

from cogent.parse.fasta import MinimalFastaParser
path = 'temp/achromo.fasta'
FH = open(path)
data = FH.readlines()
p = MinimalFastaParser(data)
for e in p:
print e[0]
print e[1][:40]


Thursday, December 24, 2009

Processing BLAST output

I was hoping to figure out how to use PyCogent's parsers to handle the output from BLAST, but I haven't succeeded at that yet. So, before I leave this topic (at least for a while) I thought I would show again the solution that I've used for my own projects. I talked about this before here. For starters, I think we want to use XML, since NCBI may potentially change other formats at will.

You can view a BLAST search as having up to 3 levels of elements:

• multiple sequences that are in the same query file
• multiple hits, sequences in the database that match
• multiple HSPs (high-scoring segment pairs), within each hit

Here is a graphic from the old post which shows this reflected in the structure of an XML file.

An HSP has a query sequence (qseq), hit sequence (hseq) and midline--- '|' characters showing matches or a space for no match. For what follows, I'll consider just the simpler case where only a single query has been input to BLAST.

Using the setup from last time (here and here), I do this from the command line:

~/Software/blast/programs/blast-2.2.22/bin/blastall \
-i ~/Desktop/temp/inseqs.fasta -p blastn -m 7 \
-d ~/Desktop/temp/refseqs.fasta \
-o ~/Desktop/blast_test.xml

Now we want to parse the results in blast_test.xml. The first section is the parsing code, and the second does the printing.

import xml.etree.ElementTree as ET

def parseBLASTIteration(iteration, howmany=3):
hitL = list()
for hit in iteration.findall('Iteration_hits/Hit')[:howmany]:
hitD = dict()
for k in ['Hit_id','Hit_def','Hit_accession']:
hitD[k] = hit.findtext(k)
hitD['hsps'] = list()
for hsp in hit.findall('Hit_hsps'):
hspD = dict()
hspD['score'] = hsp.findtext('Hsp/Hsp_score')
hspD['evalue'] = hsp.findtext('Hsp/Hsp_evalue')
hspD['identity'] = hsp.findtext('Hsp/Hsp_identity')
hspD['gaps'] = hsp.findtext('Hsp/Hsp_gaps')
hspD['length'] = hsp.findtext('Hsp/Hsp_align-len')
hspD['query'] = hsp.findtext('Hsp/Hsp_qseq')
hspD['midline'] = hsp.findtext('Hsp/Hsp_midline')
hspD['hitseq'] = hsp.findtext('Hsp/Hsp_hseq')
identity = int(hspD['identity'])
length = int(hspD['length'])
hspD['%identity'] = identity*100.0/length
except ZeroDivisionError:
hspD['%identity'] = 'error'
return hitL

def parseSingleIteration(tree,howmany=3):
iteration = tree.find('BlastOutput_iterations/Iteration')
hitL = parseBLASTIteration(iteration,howmany)
return hitL

def showHitList(hitL,withaccession=True):
for j,hitD in enumerate(hitL):
print 'hit #', j+1
for k in ['Hit_id','Hit_def','Hit_accession']:
print k, hitD[k]
if withaccession:
print hitD['Hit_accession'].ljust(10),
hspL = hitD['hsps']
for hspD in hspL[:1]:
print hspD['identity'] + '/' + hspD['length'],
print ('%3.2f' % hspD['%identity']).rjust(7)

def printHspAlignment(hspD):
line_length = 60
query = hspD['query']
midline = hspD['midline']
hitseq = hspD['hitseq']
for i in range(0,len(query),line_length):
print query[i:i+line_length]
print midline[i:i+line_length]
print hitseq[i:i+line_length]

fn = 'blast_test.xml'
tree = ET.parse(fn)
hitL = parseSingleIteration(tree)

Here is the output:

$ python 
hit # 1
Hit_id lcl|s2
Hit_def No definition line found
Hit_accession s2
s2 26/29 89.66
||||||||||| |||||| |||||||||

hit # 2
Hit_id lcl|s3
Hit_def No definition line found
Hit_accession s3
s3 12/12 100.00

hit # 3
Hit_id lcl|s1
Hit_def No definition line found
Hit_accession s1
s1 12/12 100.00

Wednesday, December 23, 2009

BLAST: .ncbirc file

Continuing from last time, here is a bit about the .ncbirc file used for BLAST.

First, as long as the database has been pre-formatted for BLAST, the .ncbirc file is not strictly necessary. For example, having pre-processed a FASTA-formatted database file, I can run blastall specifying full paths to the executable, input file, and database:

~/Software/blast/programs/blast-2.2.22/bin/blastall \
-i ~/Desktop/temp/inseqs.fasta -p blastn \
-d ~/Desktop/temp/refseqs.fasta

The .ncbirc file is supposed to make things easier. According to the docs (or see the documentation that comes with the downloaded executable in blast-2.2.22/doc/blast.html):

To ensure the smooth execution of blast programs, we should set up a BLAST configuration file named .ncbirc to provide the path information for the data and db directories. If we place the blast-#.#.# directory under the home directory of j_smith, we can specify the path to data and db directories using lines below.



This had me confused for quite a while. In the first place, they do not mean that I should literally do DATA=/home/te... Rather, the path should be something like DATA=/Users/te... under OS X. Second, they do not mean that the db or data directory should be directly under my home directory (or even directly in the blast-#.#.# directory).

Third, the path should be the path to the directory containing the database, but should not include the name of the database file itself. Using .ncbirc does not relieve you of the need to specify a database file (except in the case where you want the default which is the nr database).

I put the following into .ncbirc in my home directory


Then I did:

blastall -i temp/inseqs.fasta -p blastn -d refseqs.fasta

This works! However, this shorthand for the home directory does not work:


And all of this can be done without a .ncbirc file by manipulating environment variables. For example:

rm ~/.ncbirc
export BLASTDB=~/Desktop/temp
export BLASTMAT=~/Software/blast/programs/blast-2.2.22/data

And this works:

blastall -i temp/inseqs.fasta -p blastn -d refseqs.fasta

PyCogent 15: BLAST

I'm exploring PyCogent (docs, download page)---first post here.

If you've done anything in Bioinformatics, you've probably run a BLAST search at the NCBI website.

NCBI also makes the same programs available in a form that can run locally on a variety of platforms (e.g. for OS X). These are particularly easy to use because they are pre-compiled and simply need to be downloaded to use. The usual reason to do this is to run searches against a user-defined database or to run a very large number of searches.

PyCogent includes an Application Controller for BLAST as well as many other programs. The "promise" of PyCogent is that it will provide a standardized interface to the command line version of BLAST. That is, it should be more flexible than just doing it directly (I'll explain the details below):

blastall -i temp/inseqs.fasta -p blastn -m 9 -d ~/Desktop/temp/refseqs.fasta

And it should be simpler than using Python's built-in module subprocess, which is already pretty easy:

import os, subprocess

def blastall(db,fn,ofn,howmany=10):
cmd = 'blastall'
cmd += ' -i ' + fn
cmd += ' -p ' + 'blastn'
cmd += ' -d ' + db
cmd += ' -o ' + ofn
cmd += ' -m 7 '
cmd += ' -b ' + str(howmany) # ALIGNMENTS
p = subprocess.Popen(cmd, shell=True)
pid,ecode = os.waitpid(, 0)
return ecode

BLAST has recently been updated to provide a new suite of programs (BLAST+), and I'll deal with that issue later. For the moment, let's look at the latest versions of the classic or "legacy" software (v 2.2.22 as of today). You can grab them from here.

The following instructions (like everything on this blog) are for OS X.

While any program can be invoked using its full path, it's helpful to provide a link that is available on your $PATH. I did this in two steps. I have a file in my home directory (a hidden file) named .bash_profile that contains this line:

PATH=$PATH:$HOME/bin/ ;export PATH

And I made symbolic links to two of the executables like this:

ln -s ~/Software/blast/programs/blast-2.2.22/bin/blastall ~/bin/blastall
ln -s ~/Software/blast/programs/blast-2.2.22/bin/formatdb ~/bin/formatdb

Since the ~/bin directory is on my $PATH, this now works from the command line to display the BLAST help menu:

blastall --help

Almost any BLAST run will involve specifying an input file with the query sequence(s), an output file to save the results, and a database file to search against. For protein searches the substitution matrix and its path will also need to be specified. With a one-time search it is easiest to do all of this in the command we provide to execute the program.

But the first required step is to pre-process the database sequences. The program needed (formatdb) is in the folder with the other BLAST programs. A folder on my Desktop contains a single file with some fake database sequences (temp/refseqs.fasta):


It can be formatted by doing this:

formatdb -i temp/refseqs.fasta -p F

The formatdb program will write a logfile (formatdb.log) to the directory from which you execute the above command, or you can specify the location to write it. I can run it with the command formatdb (since it's on my path) and provide two parameters: a relative path to the database, and -p F, which designates that protein = False. This is required since the default is T.

Alternatively, you can use PyCogent (here, from the interpreter):

>>> from import build_blast_db_from_fasta_path
>>> result = build_blast_db_from_fasta_path('temp/refseqs.fasta')

The variable result contains a list of the files that were written (plus the input sequence file).

Command-line BLAST invocation

BLAST can be run quite simply from the command-line. If we put the following sequence in a file in the temp directory (temp/inseqs.fasta):


Then, from the command line:

blastall -i temp/inseqs.fasta -p blastn -m 9 -d ~/Desktop/temp/refseqs.fasta

# BLASTN 2.2.22 [Sep-27-2009]
# Query: s2_like_seq
# Database: /Users/te/Desktop/temp/refseqs.fasta
# Fields: Query id, Subject id, % identity, alignment length, mismatches, gap openings, q. start, q. end, s. start, s. end, e-value, bit score
s2_like_seq s2 89.66 29 2 1 1 28 1 29 6e-05 26.3
s2_like_seq s3 100.00 12 0 0 1 12 1 12 2e-04 24.3
s2_like_seq s1 100.00 12 0 0 1 12 1 12 2e-04 24.3

Another straightforward way is to use Python's subprocess module. For example, I put the code at the top of the post into a script. Now, just call the function blastall and provide paths to the sequence file, the database, and the file to write results. The -m flag above specifies output format. The help lists more options.

  -m  alignment view options:
0 = pairwise,
1 = query-anchored showing identities,
2 = query-anchored no identities,
3 = flat query-anchored, show identities,
4 = flat query-anchored, no identities,
5 = query-anchored no identities and blunt ends,
6 = flat query-anchored, no identities and blunt ends,
7 = XML Blast output,
8 = tabular,
9 tabular with comment lines
10 ASN, text
11 ASN, binary [Integer]
default = 0
range from 0 to 11

Finally, we can also use PyCogent to do this. The idea is that PyCogent will provide a standardized interface for Application Controllers. Unfortunately, it's not quite there yet. In some more complicated cases, a "convenience" function is often available, as used in the code below.

PyCogent (again, from the interpreter):

>>> from import blast_seqs, Blastall
>>> from cogent import LoadSeqs, DNA
>>> seqs = LoadSeqs(filename='temp/inseqs.fasta', aligned=False)
>>> seq = seqs.getSeq('s2_like_seq')
>>> seq
ByteSequence(TGCAGCT... 28)
>>> params={'-p':'blastn','-n':'F','-m':'9'}
>>> data_path = '~/bin/blast/programs/blast-2.2.22/data'
>>> result = blast_seqs([seq],
... Blastall,
... blast_db = 'temp/refseqs.fasta',
... blast_mat_root = data_path,
... params = params)
>>> data = result['StdOut'].read()
>>> print data
# BLASTN 2.2.22 [Sep-27-2009]
# Query: 1
# Database: temp/refseqs.fasta
# Fields: Query id, Subject id, % identity, alignment length, mismatches, gap openings, q. start, q. end, s. start, s. end, e-value, bit score
1 s2 89.66 29 2 1 1 28 1 29 6e-05 26.3
1 s3 100.00 12 0 0 1 12 1 12 2e-04 24.3
1 s1 100.00 12 0 0 1 12 1 12 2e-04 24.3

One last point. In the PyCogent example we need a parameter named blast_mat_root. This is the path to the data directory that contains substitution matrices. It isn't actually used (since we're dealing with DNA sequences), but blast_seqs doesn't seem to know that.

It might be nice not to have to specify the path to the database every time we run blast. The suggested way to do this is to provide a file named .ncbirc in my home directory. I didn't find the instructions very clear, but I think I got it figured out. I'll have more to say about that next time.

Thursday, December 17, 2009

Matplotlib in OS X: Heat Maps

I've been exploring matplotlib. It's complex enough that I probably need to get a book. I decided to revisit the problem of producing heat maps. In the last couple of posts (here and here), I mentioned a couple of methods that might be used: (scatter, pcolormesh, and pcolor).

But, today I'm going to use matplotlib.patches.Rectangle. Rather than post the code (140 lines in two scripts), I made a zipped Archive that includes:

• a directory containing three data files
(counts, topo.colors, colors for the rownames)
• and
• example.pdf

Above is a part of the figure the code draws The whole thing is up on Dropbox here.

The code to draw the boxes is:

    p = Rectangle((x,y),width=dx,height=dy,
if n == 0: continue
t = ax.text(x+dx/2,y+dy/2,str(n),

It took significantly less time to write this than the Cocoa app that does the same thing, and I think it gains a lot in flexibility.

Here is an example in black and white. It uses the result from binCounts() as the color (i.e. gray).

Tuesday, December 15, 2009

Matplotlib: topo.colors

I looked around a bit on the web for topo colors for matplotlib, but didn't see anything. So, I used my new understanding of how to make a LinearSegmentedColormap, and the previous analysis of R's topo.colors from here. Reprinting the figure from that post

we see how the RGB components vary individually over the course of the range 0-1000 (or 0.000 to 1.000). Examining the values for the colors obtained from R at those points (in R do: topo.colors(1000), I estimated these values:

1    4c00ff
84 0000ff
334 00e5ff
335 00ff4d
418 00ff00
667 e6ff00
668 ffff00
850 ffdd62
1000 ffe0b3

I converted the hex values for the colors at the breakpoints to fractions of 256 in Python:

L = ['4c','b3','dd','e0','e5']
>>> for e in L:
... print e, int(e,16), int(e,16)*1.0/256
4c 76 0.296875
b3 179 0.69921875
dd 221 0.86328125
e0 224 0.875
e5 229 0.89453125

And that allowed me to construct the Colormap. In filling out the map, I came to appreciate the unusual data structure. It's natural to extend something like this:

x y0 y1
x y0

by adding the next y1 to the middle row (the value of the color at the beginning of the interval), and the next y0 to the last row (the value at the end of the interval). Here's what I got. It looks pretty good to me:

Here's the code:

import matplotlib.pyplot as plt
import matplotlib
import numpy as np

# topo colors
r = ((0.000, 0.000, 0.297),
(0.084, 0.000, 0.000),
(0.418, 0.000, 0.000),
(0.667, 0.895, 0.895),
(0.668, 1.000, 1.000),
(1.000, 1.000, 1.000))

g = ((0.000, 0.000, 0.000),
(0.084, 0.000, 0.000),
(0.334, 0.895, 0.895),
(0.335, 1.000, 1.000),
(0.667, 1.000, 1.000),
(0.850, 0.863, 0.863),
(1.000, 0.875, 1.000))

b = ((0.000, 0.000, 1.000),
(0.334, 1.000, 1.000),
(0.335, 0.297, 0.297),
(0.418, 0.000, 0.000),
(0.667, 0.000, 0.000),
(1.000, 0.700, 1.000))

colors = {'red':r, 'green':g, 'blue':b}

f = matplotlib.colors.LinearSegmentedColormap
m = f('my_color_map', colors, 256)
A = np.array(range(100))
A.shape = (10,10)

Matplotlib in OS X (4)

While hunting around for colors for heatmaps, I came across the matplotlib class matplotlib.colors.LinearSegmentedColormap. The way the colors are specified seemed a bit strange at first, but then it became clear.

The first thing is that, naturally, the red, green and blue components are handled separately. They may vary independently and don't even have to be split up into the same number of "intervals." Color values are specified corresponding to the numerical endpoints of each interval, and when it is desired to encode a number as a color, this is done by linear interpolation. Intervals and colors are specified by values between 0 and 1.

An example I found here splits the interval 0 to 1 into two parts for all three colors. The red values come from this array:

r = ((0.0, 0.0, 0.0), 
(0.5, 1.0, 0.7),
(1.0, 1.0, 1.0))

What is going on is that we have split the interval into two parts: 0 to 0.5 and 0.5 to 1.0. These are the entries [0][0], [1][0] and [2][0]. The manual calls these x. From there:

row i:   x  y0  y1
row i+1: x y0 y1
row i+2: x y0 y1

Suppose we encounter a number that lies between the x in row i and the x in row i+1. We grab the indicated values (y1 from row i, and y0 from row i+1), and just interpolate.

So we have 7 values of interest for this set of two intervals:

• the beginning and end of each interval
(where the beginning of the second equals the end of the first)

• red at the beginning and end of interval 1

• red at the beginning and end of interval 2

Two values are never used---y0 from the first row and y1 from the last. Check out the second link for lots more examples.

import matplotlib.pyplot as plt
import matplotlib
import numpy as np

r = ((0.0, 0.0, 0.0),
(0.5, 1.0, 0.7),
(1.0, 1.0, 1.0))

g = ((0.0, 0.0, 0.0),
(0.5, 1.0, 0.0),
(1.0, 1.0, 1.0))

b = ((0.0, 0.0, 0.0),
(0.5, 1.0, 0.0),
(1.0, 0.5, 1.0))

colors = {'red':r, 'green':g, 'blue':b}
f = matplotlib.colors.LinearSegmentedColormap
m = f('my_color_map', colors, 256)

Matplotlib in OS X (3)

I've been playing with matplotlib, and I have to say that the figures it produces are beautiful. Although the documentation is extensive, I do find it hard to navigate to the right pages, or find helpful examples by direct search. But of course, GIYF.

Here is a proof of principle for a heat map in a few short lines of code. I still have to work on placing text, getting the right colors, etc. The fact that it's all Python is terrific. We use a scatterplot and adjust the size of the plot symbol (a square) to be as large as we need.

Probably it's obvious, but the call plt.axis('equal') is crucial because by default the axes are not typically equal and the "squares" are therefore not square. The call yL = sum(yL,[]) is a clever trick I found on Stack Overflow for flattening a list.

[UPDATE: As is not uncommon, I found a simpler API after making this post. The result is shown below. The code is the second of the two samples.]

import matplotlib.pyplot as plt
import random

xL = range(1,11) * 10
yL = [[j]*10 for j in range(1,11)]
yL = sum(yL,[])

R = xrange(100000)
cL = [random.choice(R) for i in range(len(xL))]

fig = plt.figure()
ax = fig.add_subplot(111,axisbg='0.9') # light gray
ax.scatter(xL, yL,c=cL,


import matplotlib.pyplot as plt
import numpy as np
import random

R = xrange(10000)
data = [random.choice(R) for i in range(100)]
data = np.array(data)
data.shape = (10,10)

fig = plt.figure()

Monday, December 14, 2009

Matplotlib in OS X (2)

Now that I have a working intall of matplotlib, I'm going to explore its capabilities to replace R, at least for plotting stuff. The more heavy duty components of R will still be irreplaceable.

The matplotlib example for histograms is here. It's not very intuitive, at least at first sight. I took nearly everything in the example code and wrapped it in a function with a silly name and placed it in a module called H. If you want to follow this example it should be simple to figure out what I did.

[UPDATE: Although this works well, and is said to be more efficient, there is a simpler API. See example here]

The script that connects my data with the plot code is just a couple of lines:

import random
from random import gauss
import matplotlib.pyplot as plt
from H import doWeirdStuffToMakeAHistogram

mu = 10
sigma = 3
L = [gauss(mu,sigma) for i in range(1000)]

Sunday, December 13, 2009

Matplotlib in OS X (1)

Although matplotlib is a sophisticated approach to plotting and it uses Python, I have used R instead for two reasons: Bioconductor uses R, and I couldn't get matplotlib installed properly. Since PyCogent uses matplotlib and I've got it working now, I thought I'd take a look.

The immediate goal is to get some numbers to use to work up a simple example of Principal Coordinates Analysis with PyCogent (and try to remember the difference with Principal Component Analysis). According to the abstract of this book chapter:

The problem is that PCA is based on the correlation or covariance coefficient, and this may not always be the most appropriate measure of association. Principal coordinate analysis (PCoA) is a method that, just like PCA, is based on an eigenvalue equation, but it can use any measure of association (Chapter 10). Just like PCA, the axes are plotted against each other in a Euclidean space, but the PCoA does not produce a biplot (a joint plot of the variables and observations).

This should all be similar to what we did previously in R. Here is a Python script, in two parts. The first part generates generates the points plotted in red as separate lists of x and y values, and makes a scatterplot. 's' is for the size of the plot character (weirdly, 'size' doesn't work).

The second part uses the array manipulation from numpy to rotate these points. The transformation matrix is t, and the multiplication is achieved by dot(t,a). The matplotlib OS X GUI doesn't work properly, so I just save the output as a PDF. (In contrast, the R OS X GUI is beautiful and always works perfectly for me. I think the matplotlib developers must be Windows or Unix guys. That would also explain why the Makefile is out of date).

The figure is a little off because I haven't figured out how to set the plot limits yet.

Here's the code:

import matplotlib.pyplot as plt
from numpy import *
from math import sqrt
import random

R = range(-100,100)
def fx(): return random.choice(R)
r = range(-10,10)
def fy(): return random.choice(r)

xL = [fx() for i in range(30)]
yL = [fy() for i in range(30)]

fig = plt.figure()
ax = fig.add_subplot(111)


a = xL+yL
a = array(xL + yL)
a.shape = (2,len(xL))

z = 1.0/sqrt(2)
t = [[z,-z],[z,z]]
t = array(t)
xL,yL = dot(t,a)

ax2 = fig.add_subplot(111)

Saturday, December 12, 2009

Duly quoted

I found a thread discussing Python versus R. If you, like me, have used both R and Python, I think you'll find it interesting. My favorite quote:
The difference between a general purpose language and a DSL is the difference between a cow and a carton of milk. Sure, the cow has lots more features - it can make lots more milk and fertalizer to boot, and it can even pull a wagon if you are so inclined. But if all you want is a glass of milk and you get a cow, then you've got an oversized bovine crapping in your office.

Links and lists

I got sucked into Project Euler for a while. I was even in the top 1000 (for about a week). Not bragging or anything…

But the problems got too speed-oriented, and I haven't been back in about 2 years. In that spirit (the top 1000 spirit), when I google "python bioinformatics", there is a link to this blog at # 22.


Sorry. I watched Full Metal Jacket the other night.

That's not good enough, people.

Still, the guy with the top link wrote a book. Unfortunately, he gave up on the blog. Here is an moderately interesting post about the end of things. Love the graphic.

Here is a poster at # 4. He likes PyCogent. Unfortunately, he works for a company who want to commercialize phylogenetics software. Not crazy about that.

Do you know anybody I can suck up to avoid the prospect of these words vanishing into the ether? Oh...right, they're gone! Except for Google Data.


As you can tell from recent posts, I'm exploring PyCogent (docs). The most important reason that I'm interested in it is that it has powerful methods for doing phylogenetic analysis. Also, I have a good feeling about the people involved. I hope it has staying power. I never had that feeling about BioPython.

So I've started looking at the Cookbook and other examples. As usual, the docs don't seem too complete. (Nobody wants to write documentation). Still, I think it could help me get better at phylogenetics to go through this project in more detail. To help me understand, I needed a way to look at the organization of a documentation file at each different levels.

The levels are keyed by symbols in the text that occur on the line following a header, like this.


Python is a great tool for problems like this. I wrote a short script to make outlines from rst files. It looks cleaner if you use spaces instead of the control characters, but this helps me see what's going on. Here is some output:

file: Alignments.rst
* Collections and Alignments
= Sequences
- Basic DnaSequence objects
^ Constructing a SequenceCollection or Alignment object from strings
^ Loading a collection or alignment from a file
^ Converting a SequenceCollection to FASTA format
^ The elements of a collection or alignment
^ Access individual sequences
^ Keeping a subset of sequences from the alignment
^ Parsing files with many sequences
^ Loading protein sequences in a Phylip file
^ Loading FASTA sequences from an open file or list of lines
^ Loading DNA sequences from a GenBank file
- Alignments
^ Creating an Alignment object from a SequenceCollection
^ Converting an alignment to FASTA format
^ Converting an alignment into Phylip format
^ Converting an alignment to a list of strings
- Slicing an alignment
^ By rows (sequences)
^ Getting a single column from an alignment
^ Getting a region of contiguous columns
^ Getting codon 3rd positions from an alignment
^ Filtering positions
^ Filtering sequences
- Motifs
^ Computing motif probabilities from an alignment
^ Obtaining one column from a slice of an alignment
^ Filtering a single column for a character
^ Calculating gap fractions for each column in an alignment
^ Getting all variable positions from an alignment
^ Getting all variable codons from an alignment
^ Remove all gaps from an alignment in FASTA format
^ Getting the third sequence from an Alignment as a Sequence object
^ Getting 3rd positions from codons
^ Getting 1st and 2nd positions from codons
= Trees
- Selecting subtrees
- Drawing trees

And here is the code:

import sys,os
fn = sys.argv[1]
L = [fn]
except IndexError:
dL = os.listdir(os.getcwd())
L = [fn for fn in dL if fn.endswith('.rst')]

for fn in L:
print 'file:', fn
FH = open(fn,'r')
data ='\n')

targets = '*=-^"+'
results = list()
def f(c):
i = targets.index(c)
i += 1
return c + ((2*i-1) * ' ')

for i,line in enumerate(data[:-1]):
next = data[i+1]
if not next: continue
c = next[0]
if c in targets and next[1] in targets:
results.append(f(c) + line)
print '\n'.join(results)


Well, I finally bought some music from iTunes. I'm probably the last person on the planet to do so, but that's OK. My 13-year old son can't handle the apparent contradiction---"Whoooah, Dad." Shock and awe.

There's couple of things I like:

• they ditched the DRM, so I copied my whole library including the new songs over to my new OS X Server box, wireless. Very cool. Guess I need some speakers (and a solution to putting them in the TV room / Kitchen).

• The audio quality is adequate with AAC-encoding. I think it's not as clean as AIFF, but it works for me.

• I can get my favorite old tunes---you know, the ones I already have (or had) on vinyl.

So, being too lame to actually post an audio link (how?) here are my two favorite Jerry Garcia guitar songs. They do not come from his work with the Dead, but a concert with Merl Saunders at Keystone.

Positively 4th Street
The Harder They Come

We're rockin' out here. Let's see, favorite guitarists?

Jimi Hendrix
Eric Clapton
Jimmy Page
Jerry Garcia
Carlos Santana
Mark Knopfler
Larry Carlton
Lee Ritenour
David Gilmour
Frank Zappa

Though I like I like the top part of the Rolling Stone list, anybody who would rank Keith Richard ahead of Mick Taylor is out of his tree. Just saying…

Pecan Tart

It's not much to do with bioinformatics, but I don't want to set up a different blog or there'll be two places to search.

Aunt Kate's Praline Pecan Tart

375°F oven

1/2 C butter, as cold as practical
1/3 C sugar
2 egg yolks
1 1/4 C flour

Beat butter and sugar with electric beater until mixed a bit, then beat in egg yolks. Add flour and beat until dough is just formed. (Don't let the butter warm up too much). Working quickly, press into bottom and sides of greased 10 inch tart pan with removable bottom. Put pan on cookie sheet on foil; bake at 375 deg for 15 min or until cooked and slightly brown. Cool.

2 C coarsely broken pecans
2/3 C butter, room temp
1 1/2 C light brown sugar
2/3 C dark corn syrup
6 T cream (light or whipping, not heavy)

Spread pecans over bottom of crust. Melt butter in a saucepan, add the sugar and corn syrup. Make sure butter mixture is just melted and not too hot. Add the cream. Mix well and bring to a boil, then turn down heat and simmer for 2 min. Pour over tart. Bake at 375 deg for 12-15 min until bubbling briskly.

Cool slightly. Remove fluted edge. Can store at room temp 1 day. Can be frozen.

Incredible. Very rich.

Thursday, December 10, 2009

PyCogent 14: FastTree

I'm exploring PyCogent (docs, download page)---first post here.

In browsing the PyCogent docs, I came upon something about an Application Controller for FastTree. Since I didn't know what that was, I checked it out here. It looks very interesting. I downloaded the source and built it with this line (from their documentation):

gcc -lm -O3 -finline-functions -funroll-loops -Wall -o FastTree FastTree.c

Simple enough. Here is a script that exercises it. I start with 20 16S rRNA sequences, some of them are experimentally determined (with prefix DA) and the rest are database sequences. You'll have to substitute your own sequences for this file. We load the sequences, align them with muscle, make a tree with fasttree, and plot the result with matplotlib. It runs in about 15 seconds. That is very fast.

This will probably be my last post about PyCogent for a while. I'm still working on it, but it's going to take time. Also, I got a new machine to play with: a mini running OS X Server.

import sys
from cogent import LoadSeqs,DNA
from import fasttree,muscle
from cogent.draw import dendrogram

def get_aln(fn):
f = muscle.align_unaligned_seqs
seqs = LoadSeqs(filename=fn,
print seqs.getNumSeqs()
return f(seqs,DNA)

def get_color(node):
if node.isTip():
name = node.getTipNames()[0]
if name[:2] == 'DA': return 'red'
return 'blue'
return 'black'

def get_tree(aln):
f = fasttree.build_tree_from_alignment
return f(aln,moltype=DNA)

def plot_tree(tr):
np = dendrogram.ContemporaneousDendrogram(tr)

fn = 'capno1.fasta'
aln = get_aln(fn)
tr = get_tree(aln)

PyCogent 13: Tree with colored edges

I'm exploring PyCogent (docs, download page)---first post here.

Continuing from the last post, we use PyCogent to plot the tree. By looking at the source code, I figured out how to color the edges, but I haven't seen how to color the tip labels. It would also be useful to be able to edit the labels.

from cogent import LoadTree
from cogent.draw import dendrogram

fn = 'temp.tree'
t = LoadTree(fn)

def f(node):
L = ['DQ450530','AJ278451','AF411020']
if node.isTip():
name = node.getTipNames()[0]
for e in L:
if e in name: return 'blue'
return 'red'
return 'black'

height, width = 500,500
np = dendrogram.ContemporaneousDendrogram(t)

PyCogent 12: simple Phylogenetic Tree

I'm exploring PyCogent (docs, download page)---first post here.

I want to use the sequences we downloaded last time to make a simple phylogenetic tree. Before we do that, however, let's modify the title lines since they are way too long:

>gi|15384333|gb|AF411019.1| Achromobacter xylosoxidans strain AU0665 16S ribosomal RNA gene, partial sequence

Since I don't know how to do that with an Alignment (or even a Sequence Collection) object yet, we'll modify the original file of downloaded sequences and repeat the alignment.

fn = 'temp.fasta'

def short(s):
stuff,genus,species = s.split()[:3]
gb = stuff.split('|')[3].split('.')[0]
return '_'.join((gb,genus[0],species))

def modifyNames(fn):
FH = open(fn,'r')
data =
L = data.split('\n\n')
pL = list()
for e in L:
title,seq = e.strip().split('\n',1)
title = short(title)
pL.append('>' + title + '\n' + seq)
FH = open(fn,'w')


The next step is to do the alignment. (I showed this code in the other post). Given the alignment, we make the tree like this:

from cogent import LoadSeqs, DNA
from cogent.evolve.models import GTR,HKY85
from cogent.phylo import distance, nj

def prelim():
fn = 'temp.aln.fasta'
seqs = LoadSeqs(fn,format='fasta')
model = HKY85()
dcalc = distance.EstimateDistances(seqs,model)
d = dcalc.getPairwiseDistances()
tree = nj.nj(d)
print '\n' + tree.asciiArt() + '\n'
print d.items()[0][0]
print d.items()[0][1]


The output is (the progress indicator is turned off):

| \-AJ278451_A_xylosoxidans
| | /-AJ002809_A_sp.
| \edge.0--|
-root----| \-EU373389_A_xylosoxidans

('AF411019_A_xylosoxidans', 'DQ450530_A_bacterium')

The last two lines give the first of the pairwise distances. This is a dictionary keyed by a tuple of the the two sequence titles. We'll explore using PyCogent to draw the tree next time. Having written it to disk, we can use our old friend R:

t = read.tree('temp.tree')

The result is shown at the top of the post. There is a slight problem that I haven't bothered to fix. The sequence titles saved in the Tree are quoted, and R prints them as is.

Wednesday, December 9, 2009

PyCogent 11: start w/multiple sequences

I'm exploring PyCogent (docs, download page)---first post here.

Next up are multiple sequences. In the first part we grab some rRNA sequences from NCBI.

import sys
from cogent.db.ncbi import EFetch

L = ['AF411020.1','EU373389.1',
fn = 'temp.fasta'

def test1():
e = EFetch(db='nucleotide',
s =
#print s
FH = open(fn,'w')


In the second part, we use muscle to align them (like before):

from cogent import LoadSeqs, DNA
from import align_unaligned_seqs as malign

def test2():
seqs = LoadSeqs(filename=fn,
print type(seqs)
aln = malign(seqs,DNA)


<class 'cogent.core.alignment.SequenceCollection'>

In the third part we just load the aligned sequences back into Python.

def test3():
fn = 'temp.aln.fasta'
sc = LoadSeqs(fn,format='fasta')
print sc[:40]
print type(sc)


>gi|21436540|emb|AJ278451.1| Achromobacter xylosoxidans subsp. denitrificans partial 16S rRNA gene, strain DSM 30026 (T)
>gi|15384333|gb|AF411019.1| Achromobacter xylosoxidans strain AU0665 16S ribosomal RNA gene, partial sequence
>gi|2832590|emb|AJ002809.1| Alcaligenes sp. 16S rRNA gene, isolate R6
>gi|92919431|gb|DQ450530.1| Alcaligenaceae bacterium LBM 16S ribosomal RNA gene, partial sequence
>gi|171191189|gb|EU373389.1| Achromobacter xylosoxidans strain TPL14 16S ribosomal RNA gene, partial sequence
>gi|15384334|gb|AF411020.1| Achromobacter xylosoxidans strain AU1011 16S ribosomal RNA gene, partial sequence

<class 'cogent.core.alignment.Alignment'>

Notice that when we load unaligned sequences, we have a cogent.core.alignment.SequenceCollection, while an aligned set is a cogent.core.alignment.Alignment. We'll use these going forward. And yes, those are mighty long title lines. And we can do slices on the alignment.

PyCogent 10: basic DNA methods

I'm exploring PyCogent (docs, download page)---first post here.

We'll use this method from the last post (and you'll also need to load the data from your favorite FASTA-formatted sequence file), as shown in that example.

from cogent import Sequence, DNA

# method 1: from a string
seq = Sequence(moltype=DNA,
print '1', seq[:24]

print type(seq)
print seq[:36].toFasta()

<class 'cogent.core.sequence.DnaSequence'>

print seq.Name
print seq[:36]
print seq[:36].complement()
print seq.MW() # this strand only


print seq[:36].rc()  # reverse complement
print seq.canPair(seq.rc())


print seq[-36:].toRna()
print seq.hasTerminalStop()


s = seq.withoutTerminalStopCodon()
print s[-33:]
print s.getTranslation()[-11:]
print s.getOrfPositions()

[(0, 1254)]

s = seq.toFasta().split('\n',1)[1]
print s[:36]
print type(s)

<type 'str'>

mseq = seq.shuffle()
print seq[:36]
print mseq[:36]
print seq.count('g') == mseq.count('g')
print all([seq.count(n) == mseq.count(n) for n in 'acgt'])


PyCogent 9: getting serious with Sequence

I'm exploring PyCogent (docs, download page)---first post here.

The most central and basic object type in PyCogent is Sequence. There are several ways to construct a DNA Sequence object. For the demo, I need to load a FASTA-formatted sequence, and recover the title and DNA sequence separately. (Substitute your own to follow along).

from cogent import LoadSeqs, DNA, Sequence

def loadData(fn):
FH = open(fn, 'r')
data =
return data

fn = 'cogent/SThemA.fasta'
data = loadData(fn)
n,s = data.strip().split('\n',1)
n = n[1:]

Now, here are five ways to create a DNA Sequence object:

# method 1:  from a string
seq = Sequence(moltype=DNA,
print '1', seq[:24]

# method 2: from a text file
tfn = 'cogent/SThemA.txt'
seq = Sequence(moltype=DNA,
print '2', seq[:24]

# method 3: from a file
# guess format from extension
seq = Sequence(moltype=DNA,
print '3', seq[:24]

# method 4: a method in DNA
seq = DNA.makeSequence(s)
seq.Name = n
print '4', seq[:24]

# method 5: by joining two DNA sequences
a = DNA.makeSequence("AGTACACTGGT")
seq2 = seq[:6] + a
print '5', seq2

# method 6: from an NCBI record (via str)
from cogent.db.ncbi import EFetch
e = EFetch(db='nucleotide',
data =
n,s = data.strip().split('\n',1)
# go to method 1
seq = Sequence(moltype=DNA,
print '6', seq[:24]



Here is one more that is a little trickier:

# method 7:  using a parser
from cogent.parse.fasta import FastaParser

# must modify for lowercase
g = FastaParser(f)
n,s =
s = str(s).upper()
# go to method 1
seq = Sequence(moltype=DNA,
print '7', seq[:24]

# simpler use is uppercase sequence
fn2 = '.temp'
FH = open(fn2,'w')
FH.write('>' + n + '\n' + s.upper())

g = FastaParser(f)
n,s =
# go to method 1
seq = Sequence(moltype=DNA,
print '8', seq[:24]


We'll look at the most basic methods of DNA Sequence objects next.

elementary $PATH stuff

As I mentioned the other day, it seems like PyCogent likes to use the $PATH variable rather than Python's sys.path to look for executables like clustalw and muscle. While I was updating my clustalw (actually clustalw2) installation at work to version 2.0.12, I worked on this quite elementary BASH shell issue. (I got clustalw from EMBL-EBI, see FTP download link here).

There is a directory in my home directory ~/bin where my bioinformatics binaries live. I copied the new folder clustalw-2.0.12-macosx from the download into there, as usual. Then, I just made a symbolic link to the executable like so:

ln -s ~/bin/clustalw-2.0.12-macosx/clustalw2 ~/bin/clustalw

I used clustalw for the link since that's what Py-Cogent expects. I tested it by doing ~/bin/clustalw from the Desktop. Now I needed to put ~/bin on my PATH. You can look at the $PATH variable as follows:

echo $PATH

output before the modification:


To modify it, I made a text file called on my Desktop containing this line:

PATH=$PATH:$HOME/bin/ ;export PATH

I put it in the right place by doing:

cp ~/Desktop/x.txt ~/.bash_profile

and tested by first doing

echo $PATH

which gives /Users/te/bin/ appended to the previous output. The final test (for now, anyway) is to do clustalw from the Desktop.

$ clustalw

******** CLUSTAL 2.0.12 Multiple Sequence Alignments ********

1. Sequence Input From Disc
2. Multiple Alignments
3. Profile / Structure Alignments
4. Phylogenetic trees

S. Execute a system command
X. EXIT (leave program)

Tuesday, December 8, 2009

PyCogent 8: getting matplotlib

I'm exploring PyCogent (docs, download page)---first post here.

As I mentioned the other day, I was having a little trouble with the drawing code for case study #2 from the paper (PMID 17708774, link). (Note that this part of the example is not in the text, but in supplementary file #4, link).

Although PyCogent initially used ReportLab for at least some drawing, I'm told that they have switched over completely to matplotlib. It's a good thing I didn't know that at first. I've struggled with matplotlib in the past. They have a special page about installing on OS X (the "it sucks to be you" kind of page), where they strongly recommend that you install a separate Python.

The senior author on the PyCogent paper is Gavin Huttley, whose group is at the John Curtin School of Medical Reearch at ANU in Canberra, Australia. I posted to the PyCogent help forum at sourceforge, and Gavin very patiently walked me through the solution to the issues I encountered. Most important, he worked up a wiki page explaining in detail how to build and install matplotlib for OS X Snow Leopard.

Another problem that slowed me down was that I had installed PyCogent first with pip, and subsequently with subversion, and somehow screwed things up. I ripped PyCogent out and rebuilt it and it works fine now. See the help thread for details.

Now, in the directory holding all the files I got from the supplementary data for the paper, I do:

cd src

where the script has been modified slightly as described in the thread. A pdf is written to the figs directory. Here is a screenshot of the upper-left hand corner.

As described in the paper:
We display low G+C% to high G+C% on a spectrum from yellow to blue...We note that, in general, closely related taxa exhibit a similar color. However, certain lineages appear to have evolved to low G+C%, quickly raising questions about environmental and/or changes in DNA metabolism that may distinguish these organisms from their sister taxa. Another feature apparent from this tree is a general trend for earlier diverging lineages to be intermediate in G+C%, and for sud- den changes toward low G+C% to be more common than sud- den changes to high G+C% (there are more yellow branches surrounded by blue or green neighbors than blue branches surrounded by yellow or green neighbors).

I don't get much out of knowing the identity of the organism that has gone to low GC recently. But here is a paper (PMID 19001264) from Dan Andersson's group that may be relevant (ultimately) as to mechanism.

Monday, December 7, 2009

Python package installations

This post is really for me, in the sense of "letting google organize my head." I'm getting into PyCogent (first post here). They're switching over to using matplotlib, which has complicated installation instructions for OS X. This has always scared me away before. But, I want PyCogent to work. Before I get into that, I need to re-learn what Python installation tools are about.

I'm sure any Pythonista knows all this already, but here goes...

What is easy_install?

Easy Install comes from PEAK (Python Enterprise Application Kit). The docs are here. How did it get on my Snow Leopard system? According to the PyCogent docs front page, it comes with the stock Python install on Snow Leopard. I couldn't find it by searching the system. My site-packages directory /Library/Python/2.6/site-packages has a file named easy-install.pth but that's it. easy_install must be somewhere because this works:

easy_install --help

In particular the output of that command includes:

--upgrade (-U)
force upgrade (searches PyPI for latest versions)

usage: easy_install [options] requirement_or_url ...

According to the docs easy_install is related to setuptools (see below), like a "wrapper" for them. According to the Pylons docs, it is actually part of the setuptools module. What about that easy-install.pth file?

According to the same page:
Easy Install works by maintaining a file in your Python installation’s lib/site-packages directory called easy_install.pth. Python looks in .pth files when it is starting up and adds any module ZIP files or paths specified to its internal sys.path. Easy Install simply maintains the easy_install.pth file so that Python can find the packages you have installed with Easy Install.

The first four lines of my easy-install.pth are:

import sys; sys.__plen = len(sys.path)

Now, I think numpy-1.3 is stock on Snow Leopard. The time stamp on the referenced file is Friday, September 4, 2009 8:38 AM. I think that is the date of my upgrade to Snow Leopard, so it checks. easy_install probably comes with Snow Leopard and is used to do the numpy install.

What is setuptools?

Python code to "build, install, upgrade and uninstall Python package - easily!" It comes from here. It is what we're using when we do:

python build
sudo python install

So what is PyPi? It's the Python Package Index, "a repository of software for the Python programming language…currently 8369 packages." It either is or used to be named CheeseShop (a Monty Python thing). According to this tutorial, easy_install "gives you a quick and painless way to install packages remotely." Ah... now I get it.

What is pip?

The PyCogent Quick Installation instructions mention this. According to PyPi pip is intended as a replacement for easy_install. Among other things, pip has explicit requirement files. For example, cogent-requirements.txt:


Note that Apple has provided a special place for /site-packages to live. It is not where you might expect:

(although there is a lot there) but rather here:
You'll can see this in sys.path.