Python for Bioinformatics: March 2011

Sunday, March 27, 2011

Equation of the ellipse

Here is a simple derivation of the equation for an ellipse. It is taken from Morris Kline's book: Calculus, An intuitive and Physical Approach.

We place the two foci of an ellipse (F and F') at the coordinates (c,0) and (-c,0). Each point on the ellipse is defined by the property that the sum of the distances to F and to F' is constant, which we define as equal to 2a. Our old friend Pythagoras helps us find the distances in terms of x,y and c.

PF  = √[y² + (x-c)²]
PF' = √[y² + (x+c)²]

2a = PF + PF'
PF = 2a - PF'
√[y² + (x-c)²] = 2a - √[y² + (x+c)²]

Square both sides (and expand):

y² + x² - 2xc + c² = 4a² - 4a √[y² + (x+c)²] + y² + x² + 2xc + c²

Cancel terms (x², y², and c²) and rearrange to isolate the remaining square root:

- 2xc = 4a² - 4a √[y² + (x+c)²] + 2xc
4a √[y² + (x+c)²] = 4a² + 4xc
a √[y² + (x+c)²] = a² + xc

Square both sides again and expand:

a² (y² + x² + 2xc + c²) = a⁴ + 2a²xc + x²c²
a²y² + a²x² + a²c² = a⁴ + x²c²
a²x² - x²c² + a²y² = a⁴ - a²c²

Factor out a² - c²:

x² (a² - c²) + a²y² = a²(a² - c²)

Define b² = a² - c²

b²x² + a²y² = a²b²

Divide to obtain the familiar form:

x²/a² + y²/b² = 1

Note that when

x = 0, y = +/- b
y = 0, x = +/- a

The squaring method allows the possibility that the simplified equation has solutions that are not valid for the original version. This turns out not to be the case, and Kline deals with this issue in the book.

Also, we might note that since

b² = a² - c²

if a and b are fixed, then c is determined. For the figure shown here, b = c and hence a = √2 b.

UPDATE: A virtually identical proof can be found in Wikipedia (here)---should have looked first.

Flags, detail

A quick note about the flags. There are enough now that it's hard to be sure I got them all. So.. I went to the Flag Counter site (this page, and the next). Rather than do anything fancy, I just copied the text to a file and then processed it with Python. My database of flag images is from Wikipedia, and I shortened the file names by the country code. Since I can't remember a number of them, I wrote a Python script to harvest the Flag Counter entries and match them with the country codes (from here).

I checked the directory with the flag images by eye, which is almost certainly a mistake.

Here is the list of countries from which visitors to this site have come, in alphabetical order. The script is at the end. It shows a number of typical issues you run into with this kind of processing.

AE  United Arab Emirates
AL  Albania
AN  Netherlands Antilles
AR  Argentina
AT  Austria
AU  Australia
BB  Barbados
BE  Belgium
BG  Bulgaria
BH  Bahrain
BR  Brazil
BY  Belarus
CA  Canada
CH  Switzerland
CL  Chile
CN  China
CO  Colombia
CR  Costa Rica
CS  Serbia
CV  Cape Verde
CY  Cyprus
CZ  Czech Republic
DE  Germany
DK  Denmark
DZ  Algeria
EC  Ecuador
EE  Estonia
EG  Egypt
ES  Spain
FI  Finland
FR  France
GH  Ghana
GR  Greece
HK  Hong Kong
HR  Croatia
HU  Hungary
ID  Indonesia
IE  Ireland
IL  Israel
IN  India
IS  Iceland
IT  Italy
JM  Jamaica
JP  Japan
KR  South Korea
LT  Lithuania
LU  Luxembourg
MA  Morocco
MD  Moldova
MT  Malta
MU  Mauritius
MX  Mexico
MY  Malaysia
NL  Netherlands
NO  Norway
NZ  New Zealand
PA  Panama
PE  Peru
PH  Philippines
PK  Pakistan
PL  Poland
PR  Puerto Rico
PT  Portugal
QA  Qatar
RO  Romania
RU  Russia
SA  Saudi Arabia
SE  Sweden
SG  Singapore
SI  Slovenia
SK  Slovakia
SV  El Salvador
TH  Thailand
TN  Tunisia
TR  Turkey
TT  Trinidad and Tobago
TW  Taiwan
UA  Ukraine
UK  United Kingdom
US  United States
UY  Uruguay
VE  Venezuela
VN  Vietnam
ZA  South Africa

from utils import load_data

specials = { 'South_Korea':'Korea_(South)',
             'Russia':'Russian_Federation',
             'New_Zealand':'New_Zealand_(Aotearoa)',
             'Serbia':'Serbia_and_Montenegro',
             'Croatia':'Croatia_(Hrvatska)',
             'Vietnam':'Viet_Nam' }

data = load_data('country-codes.txt')
D = dict()
for line in data.strip().split('\n'):
    L = line.strip().split()
    D['_'.join(L[1:])] = L[0]

cL = list()
data = load_data('scraped.txt')
for line in data.strip().split('\n'):
    L = line.strip().split()
    i = len(L) - 4
    country = '_'.join(L[1:i])
    cL.append(country)
    if country in specials:
        k = specials[country]
        D[country] = D[k]

def f(k):  return D[k]
for country in sorted(cL, key=f):
    print D[country],'\t', country.replace('_',' ')

Flag update

We continue to accumulate unique visitors from new countries. Here are the flags of 16 more.

Thanks for reading!

Friday, March 18, 2011

nothing in biology makes sense except in the light of evolution

I've been reading Mike Yarus's book: Life from an RNA world (Amazon). It's a very readable account of evolution from the perspective of sequences and the RNA world. Mike's a highly intelligent guy, and that and his wit inform every page. In one chapter, he gets Rob Knight and Steve Freeland to do an evolution simulation.

we let a computer write out a random string, mutate 1 in 100 characters in each generation, and select changes only if they match Dobzhansky

In my version:

in each generation we pick one position

if it already matches Dobzhansky, continue to the next generation

mutate to a random choice from the set of symbols

This isn't a very good model of evolution. It was just fun to spend a few minutes coding it.

Here is the beginning, middle and end of one run:

> python evolve.py 
    0 LVSTUknwIfGIRHFgHESZ M zthWtNQTk qgtoeMvjeJAzOidKWEZO ZwNjDyCvq
  100 LVSTUknwIfGIRHFgHESZ M zthetNQTk qgtoeMvjeJAzOidKWEZO ZwNjDyCvq
  200 LVSTUknwIfGIRHFgHESZ M zthetNQTk qgtoeMvjeJAzOidhWEZO ZwNjDyCvq
  300 LVSTiknwIfGIRHFgHESZ M zthetNQ k qgtoeMvjeJAzOidhWEZO ZwNjuyivq

 6100 notTing if biHlogy makes sense except iv the Oight of Zvolution
 6200 notTing if biHlogy makes sense except iv the Oight of Zvolution
 6300 nothing if biHlogy makes sense except iv the Oight of Zvolution
 6400 nothing if biHlogy makes sense except iv the Oight of Zvolution

13800 nothing in biHlogy makes sense except in the light of evolution
13900 nothing in biHlogy makes sense except in the light of evolution
14000 nothing in biHlogy makes sense except in the light of evolution
14100 nothing in biHlogy makes sense except in the light of evolution

14167 nothing in biology makes sense except in the light of evolution
14167 nothing in biology makes sense except in the light of evolution

import random
s = 'nothing in biology makes sense'
s += ' except in the light of evolution'
N = len(s)

symbols = 'abcdefghijklmnopqrstuvwxyz '
symbols += symbols.upper()
L = [random.choice(symbols) for i in range(N)]
print '    0 ' + ''.join(L)

i = 0
while L != list(s):
    i += 1
    v = i and not i % 100
    if v:  print str(i).rjust(5),
    j = random.choice(range(N))
    if L[j] == s[j]:  
        if v:  
            print ''.join(L)
        continue
    c = random.choice(symbols)
    if c == s[j]:  L[j] = c
    if v:  print ''.join(L)
    
print
print str(i).rjust(5), s
print str(i).rjust(5), ''.join(L)

Mutual Information (3)

We're working on a paper from Michael Laub's lab at MIT (Skerker et al 2009 PMID 18555780). Previous posts here and here.

Now it's time to analyze the data. The self comparisons (a single position in the alignment, analyzed for mutual information against itself) yield info values ranging from 4.1 to 0.01.

As you can see in the screenshot, the residue in column 16 (1-based indexing), which has a very low score, is the conserved (catalytic) histidine.

We'll filter out the self comparisons for the plots. Here is the histogram of information values that I got. It's a bit different from the paper, but not much.

In the next part, we'll try to match up the residues which look like they might interact (that have high mutual information) and see if that makes sense in terms of the protein structures.

I picked a familiar HK/RR pair to do this: EnvZ and OmpR. This screenshot of part of Fig 2 shows a section of each protein.

Searching in the alignment file (the alignments don't have titles that I recognize), I recovered a sequence that I think is probably the right one. It matches the Figure as far as I checked:

>26250004-26250005/1-457
KQLADDRTLLMAGVSHDLRTPLTRI
RLATEMMSAESINKDIEECNAIIEQ
FIDYLRTGMADLNAVLGEVIA--AE
SGYEREIETAL-YVKMHPLSIKRAV
ANMVVNAARY-GNGWIKVSSGTEAW
FQVEDDGPGIAPEQRKHLFQPFVRG
DISGTGLGLAIVQRIVDNHNGMLEL
GTSERGGLSIRAWLPNYKILVVDDD
MRLRALLERYLTEQGFQVRSVANAE
QMDRLLTRESFHLMVLDLMLPGEDG
LSICRRLRSQSPMPIIMVTAKGEEV
DRIVGLEIGADDYIPKPFNPRELLA
RIRAVLRR

I put the newlines in to help me count. The alignment is 308 residues total. From Fig 2,

EnvZ (the HK) sequence starts with:

AGVKQLADDRTLLMAGVSHDL
   KQLADDRTLLMAGVSHDL

The second line above is from the alignment. The sequence doesn't start at residue 1, however. By my calculation, residue 0 in the alignment (the K) is residue 15 in the protein (1-based index). So we'll adjust the indexes we obtain by adding 15 to compare with the actual protein sequence.

OmpR (the RR) sequence starts with

MQENYKILVVDDDMRLRALLER
   NYKILVVDDDMRLRALLER

The second line above is from the alignment, where there's an N at position 0 in this fragment. By my calculation, that N is residue 3 of OmpR in Fig 2 (1-based index), and residue 190 of the alignment, so we subtract 187 for values >= 190.

Going back to the plotting script, we filter the data for pairs in which one comes from the HK and one from the RR, and print the top 20. We do the math as outlined above. As a check, we grab the putative EnvZ/OmpR sequence from the alignment file and print the sequence starting at the position we've identified. Here are the results for the top 20 (in each pair the RR is printed first and the HK second).

[Note: to keep the code simple, I ignored the situation where a column contains '-' a gap in the alignment for some of the sequences. That's what's giving us the two strange results below at position 16 and 20.]

> python plot.py 
 14 RLRAL  42 ATEMM 0.822
 18 LLERY  42 ATEMM 0.816
 22 YLTEQ  42 ATEMM 0.769
 15 LRALL  37 TRIRL 0.730
 15 LRALL  42 ATEMM 0.694
 56 MLPGE  54 DIEEC 0.688
 14 RLRAL  54 DIEEC 0.678
 21 RYLTE  42 ATEMM 0.677
 14 RLRAL  38 RIRLA 0.668
 83 KGEEV  22 TLLMA 0.667
 22 YLTEQ  21 RTLLM 0.663
  4 YKILV  42 ATEMM 0.658
 83 KGEEV  54 DIEEC 0.657
 22 YLTEQ  18 ADDRT 0.650
 22 YLTEQ  54 DIEEC 0.646
 18 LLERY  86 --AES 0.644
 15 LRALL  54 DIEEC 0.643
 18 LLERY  38 RIRLA 0.636
 22 YLTEQ  45 MMSAE 0.633
 22 YLTEQ  86 --AES 0.631

The two residues with highest mutual information are OmpR residue 14 and EnvZ residue 42. It looks pretty good to me.

[ UPDATE: The heatmap looks pretty boring, so I'm going to skip it.
But I plotted the top 20 interactions (graphic at the top of the post), the repetition indicates we're on the right track.. ]

import sys
from utils import load_data
import matplotlib.pyplot as plt

data = load_data('results.2.txt')
data = data.strip().split('\n')
data = [e.split() for e in data]
data = [(int(t[0]), int(t[1]), float(t[2])) for t in data]

def f(t):  return float(t[2])
def part1():
    L = [t for t in data if t[0] == t[1]]
    L = sorted(L,key=f,reverse=True)
    for t in L[:10]:  print str(t[0]+1).rjust(3),round(t[2],3)
    print
    for t in L[-10:]:  print str(t[0]+1).rjust(3),round(t[2],3)
    sys.exit()
    
#part1()


L = [t[2] for t in data if t[0] != t[1]]
X = 1.0
plt.hist(L,bins=X*50)
ax = plt.axes()
ax.set_xlim(0,X)
plt.savefig('example.png')

# t[0] always > t[1]
N = 190
data = [t for t in data if t[0] >= N and t[1] < N]

aln = load_data('cell3925mmc4.fa')
aln = aln.strip().split('>')[1:]
aln = [e for e in aln if e.startswith('26250004-26250005/1-457')]
envZ_ompR = aln[0].split('\n')[1]

for t in sorted(data,key=f,reverse=True)[:20]:
    i = t[0]
    rr = i - 187
    j = t[1]
    hk = j + 15
    print str(rr).rjust(3), envZ_ompR[i:i+5],
    print str(hk).rjust(3), envZ_ompR[j:j+5],
    print '%3.3f' % t[2]

Mutual information (2)

We're working on a paper from Michael Laub's lab at MIT (Skerker et al 2009 PMID 18555780). The first post is here.

In this part we'll load the alignment (supplementary data file S4---the annotation on the page is incorrect), and crunch the numbers. I just write the results to disk.

We'll do the analysis in another post.

python info.py > results.txt

import sys
from utils import load_data
import info_helper as ih

#aln = 'AASSASSTTT\nNMWWNTTKKS\nGTSNTYRSTA\nGGGGGGGGGG'
fn = 'cell3925mmc4.fa'
data = load_data(fn)
data = data.strip().split('>')[1:]
data = [e.split('\n')[1].strip() for e in data]

def show(data):
    print 'starting:', len(data)
    for i in range(7):
        print i, 
        L = [e for e in data if e.count('-') <= i]
        print len(L)
    sys.exit()
#show(data)

def transpose(L):
    R = range(len(L[0]))
    rL = list()
    for i in R:
        rL.append(''.join([item[i] for item in L]))
    return rL

data = [e for e in data if e.count('-') <= 4]
#data = data[:100]
cols = transpose(data)
pD = ih.make_prob_dict(cols)
info = dict()

for i in range(len(cols)):
    for j in range(i+1):      
        info[(i,j)] = ih.get_info(i,j,cols,pD,v=False)
for i,j in sorted(info.keys()):
    print i,j,round(info[(i,j)],3)

Thursday, March 17, 2011

Fun with geometry (1)

I found a a couple of fun books of problems in geometry, algebra and probability (geometry book here).

This is one of the problems: given the red circles with radius one-half the large black circle, and the blue circle inscribed so as to just fit inside, derive a relation between the radius of the blue circle and the others. This had me scratching my head for a few minutes before the aha moment.

The challenge question is perhaps easier: prove that the filled-in gray area is equal to the area of one of the red circles.

And a hint for the first problem comes from the next graphic, where I've made a copy of the blue circle and positioned it strategically:

Tuesday, March 15, 2011

Mutual information

I want to talk about a really nice paper from Michael Laub's lab at MIT (Skerker et al 2009 PMID 18555780). It'll give us an opportunity to exercise our matplotlib skills.

We're going to try to recreate Fig 1, which is visible in the PubMed page, or you can get the original paper from the link to Cell.

Two-component signal transduction systems are ubiquitous in bacteria (wikipedia). The canonical design consists of a membrane-bound sensor (histidine) kinase (HK) and a cytoplasmic response regulator (RR). E. coli contains about 30 such pairs. The members of each pair have substantial specificity. The HK of the ntr system has specificity for its own RR, and likewise in the phoBR system, phoB has specificity for phoR. We may speak of a HK and its cognate RR.

For our purposes the important thing is that each system comprises (in the simplest design) two protein partners with complementary surfaces. These systems (a pair of proteins) are the products of ancient gene duplication events, and have since diverged over time. Amino acids at interacting sites are constrained to co-evolve in each pair.

If this sounds too vague or too complicated, consider an even simpler example: a stem of paired RNA residues in rRNA named H15.

Here is the H15 sequence in 1D (the parentheses indicate residues involved in pairing---see the link above for details):

((((   (((((    ))))) ))))
TGCACAATGGGCGCAAGCCTGATGCA

And here is the inner stem drawn in 2D to show the base-pairing more directly:

TGGGC
GTCCG

The base-pairing of this stem is more important to rRNA function than the identities of the bases. The result is that in some bacteria the identities of the central bases have been switched:

original   co-evolved

TGGGC        TGCGC
GTCCG        GTGCG

Presumably this happened in 2 discrete steps, but I don't know of any examples where the intermediate state has been preserved. Maybe we should look for some, and it's undoubtedly been studied.

To quantify this kind of coevolution, we'll draw on a concept (and mathematical definition) called mutual information. The steps in the calculation will be:

make a multiple sequence alignment

compare column X and column Y

total number of sequences (length of each column) = c

for each residue x in column X calculate px

for each residue y in column Y calculate py

px is the probability of residue x in column X

We'll write the columns horizontally to save space.
Suppose column X and Y are:

X:  AASSASSTTT
Y:  NMWWNTTKKS

For column X we have:

pA = 0.3 (3 A out of a total of 10 residues)
pS = 0.4
pT = 0.3

For column Y:


pK = 0.2
pM = 0.1
pN = 0.2
pS = 0.1
pT = 0.2
pW = 0.2

We pre-calculate these values for each column. When we calculate the information, we'll refer to the probabilities for column Y as q rather than p, to keep them straight from the p's for column X.

Now, we consider each pair of residues, one from column X and one from column Y. This pair is made up of residues in two interacting protein surfaces or rRNA chains, that may have co-evolved.

pxy is the number of sequences with x in column X and y in column Y

divided by c, the total number of pairs:

X:  AASSASSTTT
Y:  NMWWNTTKKS


pAM = 0.1
pAN = 0.2
pST = 0.2
pSW = 0.2
pTK = 0.2
pTS = 0.1

Finally, to compute the mutual information for this pair of columns, we do this calculation for each individual pair of residues and then sum:

pAM * log (pAM / pAqM) = 0.1 * log (0.1 / (0.3 * 0.1)) = 0.0523
I would have used log₂, but Skerker et al used log₁₀, so I matched them.

Here is part of the output of the script below:

NMWWNTTKKS
AASSASSTTT
pKT 0.2 pK 0.2 qT 0.3 temp 3.33 final 0.1
pMA 0.1 pM 0.1 qA 0.3 temp 3.33 final 0.05
pNA 0.2 pN 0.2 qA 0.3 temp 3.33 final 0.1
pST 0.1 pS 0.1 qT 0.3 temp 3.33 final 0.05
pTS 0.2 pT 0.2 qS 0.4 temp 2.50 final 0.08
pWS 0.2 pW 0.2 qS 0.4 temp 2.50 final 0.08
info 0.47

temp is the result of the calculation inside the parentheses, above. Next time we'll apply this method to the data from Skerker.

from math import log
def log2(n):  return log(n)*1.0/log(2)
def log10(n):  return log(n)*1.0/log(10)

# cache character probabilities for each column
def make_prob_dict(cols):
    # input is a list of columns
    # c is the number of sequences in the alignment
    c = len(cols[0])
    pD = list()
    for col in cols:
        char_kinds = list(set(col))
        values = [col.count(k)*1.0/c for k in char_kinds]
        pD.append(dict(zip(char_kinds,values)))
    return pD

def get_info(i,j,cols,pD,v=False):
    col1, col2 = cols[i], cols[j]
    if v:  print col1 + '\n' + col2
    # as before, c is the number of sequences
    c = len(col1)
    assert c == len(col2)
    info = 0
    pairs = [col1[k] + col2[k] for k in range(c)]
    pL = sorted(list(set(pairs)))
    for p in pL: 
        pXY = pairs.count(p)*1.0/c
        pX = pD[i][p[0]]
        pY = pD[j][p[1]]
        inside = (pXY * 1.0) / (pX * pY)
        if v:  print 'p' + p, pXY,
        if v:  print 'p' + p[0], pX,
        if v:  print 'q' + p[1], pY,
        if v:  print 'temp', '%3.2f' % round(inside, 2),
        outside = pXY * log10(inside)
        if v:  print 'final', round(outside,2)
        info += outside
    if v:  print 'info', round(info,2)
    return info
  
if __name__ == '__main__':
    aln = 'AASSASSTTT\nNMWWNTTKKS\nGTSNTYRSTA\nGGGGGGGGGG'
    cols = aln.split('\n')
    pD = make_prob_dict(cols)
    info = dict()
    for i in range(len(cols)):
        for j in range(i):
            info[(i,j)] = get_info(i,j,cols,pD,v=True)

Sunday, March 13, 2011

Cocoa: where to start?

Recommended resources for beginning Cocoa with Objective-C
NSOrderedDescending

do the temperature converter in the Cocoa Application Tutorial

short articles at Cocoa Dev (here much more here; C review)

Aaron Hillegass's book

Cocoa Fundamentals Guide (here)

Specific to PyObjC:

Will Larson's tutorials (here here here here here)

Apple's page including a version of the temperature converter

Read the official introduction carefully (my biggest problem)

A page of links to old material with simple demos of specific Cocoa features that mostly still work here.

Code a simple game like TicTacToe, Fifteen, or Color Sudoku.

After that, I've got tons of projects here and here. Get started with bindings (here), then move on to Vlad the Impaler (here).

Learn specific topics by reading the Apple docs (slowly and repeatedly, it can take a while to get it, by building a simple demo project that does only that one thing. Like NSPredicate which we've done in six posts (here here here here here & one to come).

And if you want the PyObjC templates for Xcode see here.

NSPredicate: PyObjC version

Here is the answer in the previous post, converted to PyObjC. It's a bit simpler, except that some of the methods require a "real" NSArray or NSString.

from Foundation import *
import objc

class NSString(objc.Category(NSString)):
    def validate(self):
        A = self.UTF8String().split(':')
        fm = NSFileManager.defaultManager()
        def f(s):
            print 'test', s
            home = NSHomeDirectory()
            s = s.replace('~',home)
            return fm.fileExistsAtPath_(s)
        return all([f(s) for s in A])
        
s = NSString.stringWithString_('~/Desktop')
print s.validate()

for s in ['~:~/Desktop','xyz']:
    s = NSString.stringWithString_(s)
    f = NSExpression.expressionForConstantValue_(s)
    e = NSExpression.expressionForFunction_selectorName_arguments_(
            f,'validate',None)
    results = e.expressionValueWithObject_context_(None,None)
    print results
print

p = NSPredicate.predicateWithFormat_(
        "FUNCTION(SELF, 'validate') isEqual:YES")
A = ['~/Desktop','xyz']
A = [NSString.stringWithString_(s) for s in A]
A = NSArray.arrayWithArray_(A)
for item in A.filteredArrayUsingPredicate_(p):
    print item

print p.evaluateWithObject_(NSString.stringWithString_('~/Desktop'))

output:

test ~/Desktop
True
test ~
test ~/Desktop
True
test xyz
False

test ~/Desktop
test xyz
~/Desktop
test ~/Desktop
True

NSPredicate: problem solved (almost)

It only took me about four hours, but with big help from this post, I solved the problem of filtering a string containing an array of (possibly invalid) filepaths. And, in the meantime, Dave DeLong himself had responded to my question on StackOverflow from yesterday, which if I'd just waited a bit more, would have saved me some time. Thanks, Dave.

As I mentioned (here), we're just wrapping a filtering routine up in a category on NSString. Here is the first part of the code file including the category:

#import <Foundation/Foundation.h>

@interface NSString (ValidatingFilepathArray)
- (NSNumber *) validate;
@end

@implementation NSString (ValidatingFilepathArray)

- (NSNumber *) validate {
    NSArray *A = [self componentsSeparatedByString:@":"];
    NSString *s, *p;
    NSFileManager *fm = [NSFileManager defaultManager];
    for (p in A) {
        s = [p stringByExpandingTildeInPath];
        if ([fm fileExistsAtPath:s]) {
            NSLog(@"passed: %@", p);
        }
        else {
            NSLog(@"failed: %@", p);
            return [NSNumber numberWithBool:NO];
        }
    }
    return [NSNumber numberWithBool:YES];
}
@end

In the code at the bottom of the post, we construct a "function expression" from the new NSString method. Two things really confused me. First, in the method

expressionForFunction:selectorName:arguments:

the "function" is the object itself. And the second thing was figuring out the format string for the predicate that wraps up our expression:

@"FUNCTION(SELF, 'validate') isEqual:YES"

This is the output:

> gcc -o test test.m -framework Foundation
> ./test
2011-03-13 16:53:42.974 test[3494:903] passed: ~
2011-03-13 16:53:42.976 test[3494:903] passed: ~/Desktop
2011-03-13 16:53:42.977 test[3494:903] expression for ~:~/Desktop:  YES
2011-03-13 16:53:42.978 test[3494:903] failed: xyz
2011-03-13 16:53:42.978 test[3494:903] expression for xyz:  NO
-------------------------------------------------
2011-03-13 16:53:42.979 test[3494:903] passed: ~
2011-03-13 16:53:42.980 test[3494:903] passed: ~/Desktop
2011-03-13 16:53:42.980 test[3494:903] failed: xyz
2011-03-13 16:53:42.981 test[3494:903] filtered:  ~:~/Desktop
2011-03-13 16:53:42.981 test[3494:903] passed: ~/Desktop
2011-03-13 16:53:42.982 test[3494:903] evaluate:  YES

First we construct the expression and evaluate it. In the second part, we construct the predicate and use it to filter an array or just "evaluate" a string. Still to do: test it in an app with bindings and an NSTextField or NSTableView.

Code:

int main (int argc, const char * argv[]) {
    NSAutoreleasePool * pool = [[NSAutoreleasePool alloc] init];
    NSString *validPathArray = @"~:~/Desktop";
    NSExpression *f = [NSExpression expressionForConstantValue:validPathArray];
    NSExpression *e = [NSExpression expressionForFunction:f
        selectorName:@"validate" arguments:nil];
    NSNumber *result = [e expressionValueWithObject:nil context:nil];
    NSArray *responses = [NSArray arrayWithObjects:@"NO",@"YES",nil];
    NSLog(@"expression for %@:  %@", validPathArray, 
        [responses objectAtIndex:[result intValue]]);
    
    NSString *invalidPathArray = @"xyz";
    f = [NSExpression expressionForConstantValue:invalidPathArray];
    e = [NSExpression expressionForFunction:f
        selectorName:@"validate" arguments:nil];
    result = [e expressionValueWithObject:nil context:nil];
    NSLog(@"expression for %@:  %@", invalidPathArray, 
        [responses objectAtIndex:[result intValue]]);
    
    printf("-------------------------------------------------\n");
    
    NSPredicate *p;
    p = [NSPredicate predicateWithFormat:@"FUNCTION(SELF, 'validate') isEqual:YES"];
    NSArray *A = [NSArray arrayWithObjects: validPathArray, invalidPathArray, nil];
    NSArray *fA = [A filteredArrayUsingPredicate:p];
    for (id obj in fA) {
        NSLog(@"filtered:  %@", obj);
    }
    BOOL yesorno = [p evaluateWithObject:@"~/Desktop"];
    NSLog(@"evaluate:  %@", [responses objectAtIndex:(int) yesorno]);
    [pool drain];
    return 0;
}

NSExpression: simple examples

The example from last time (here) introduced the NSExpression class. That one was rather complex in code, though what it does is just a simple filtering of values.

Here is another example which looks less forbidding: it combines two expressions in a single predicate. These can then be used to construct a comparison predicate that grabs the object for key='value' from an array, and checks it against the number 10, like this:

As before the code block is inside a standard main:

#import <Foundation/Foundation.h>

int main (int argc, const char * argv[]) {
NSAutoreleasePool * pool = [[NSAutoreleasePool alloc] init];
code here..    
[pool drain];
return 0;
}

and compiled like this:

> gcc -o test test.m -framework Foundation

code:

NSExpression *lhs = [NSExpression expressionForKeyPath:@"value"];
NSNumber *ten = [NSNumber numberWithInt:10];
NSExpression *rhs = [NSExpression expressionForConstantValue:ten];

NSPredicate *p = [NSComparisonPredicate 
predicateWithLeftExpression:lhs
rightExpression:rhs
modifier:NSDirectPredicateModifier 
type:NSGreaterThanOrEqualToPredicateOperatorType 
options:0];

NSArray *A = [NSArray arrayWithObjects:
[NSMutableDictionary dictionaryWithObject:
[NSNumber numberWithInt:3] forKey:@"value"],
[NSMutableDictionary dictionaryWithObject:
[NSNumber numberWithInt:15] forKey:@"value"], nil];
NSArray *fA = [A filteredArrayUsingPredicate:p];
for (id obj in fA) {
NSLog(@"%@", [obj description]);
}

> ./test
2011-03-13 10:58:53.885 test[218:903] {
value = 15;
}

Here is a second example, taken from the docs, of how to construct an expression that uses a built-in function (they call this a function expression):

NSArray *A = [NSArray arrayWithObjects:
[NSNumber numberWithInt:3],
[NSNumber numberWithInt:6],
nil];
NSExpression *eA = [NSExpression 
expressionForConstantValue:A];
NSArray *args = [NSArray arrayWithObject:eA];

NSExpression *e = [NSExpression 
expressionForFunction:@"average:" arguments:args];
id result = [e expressionValueWithObject:nil context:nil];
NSLog(@"%@ %@", [result description], [result class]);
float f = [result floatValue];
printf("result = %3.2f\n", f);

2011-03-13 11:01:00.337 test[236:903] 4.5 NSCFNumber
result = 4.50

There are lots of built-in functions available (here).

The third example is based on the first part (the simple part) of this post from Dave DeLong. It defines a category on NSNumber

@interface NSNumber (FactorialExpression)
- (NSNumber *) factorial;
@end

@implementation NSNumber (FactorialExpression)
- (NSNumber *) factorial {
double baseValue = [self doubleValue];
double result = tgamma(baseValue+1);
return [NSNumber numberWithDouble:result];
}
@end

and the code block is:

NSNumber *n = [NSNumber numberWithDouble:4.2];
NSLog(@"%@ %@", n, [n factorial]);
NSLog(@"%p %d", n, [n respondsToSelector:@selector(factorial)]);

NSExpression *f = [NSExpression expressionForConstantValue:n];
NSExpression *e = [NSExpression expressionForFunction:f
selectorName:@"factorial" 
arguments:nil];
NSLog(@"operand %@ %@", [e operand], [[e operand] class]);
NSLog(@"operand %@", [e function]);

id result = [e expressionValueWithObject:nil context:nil];
NSLog(@"%@ %@", [result description], [result class]);

> ./test
2011-03-13 11:02:46.798 test[251:903] 4.2 32.57809605033135
2011-03-13 11:02:46.800 test[251:903] 0x100108d20 1
2011-03-13 11:02:46.801 test[251:903] operand 4.2 NSConstantValueExpression
2011-03-13 11:02:46.801 test[251:903] operand factorial
2011-03-13 11:02:46.802 test[251:903] 32.57809605033135 NSCFNumber

And now, the way to solve my initial question seems clear:

Define a category on NSString that does what I want.
Wrap the call up in an NSExpression and then an NSPredicate. Next time, if I succeed.

Saturday, March 12, 2011

NSPredicate: compound Cocoa example in code

I'm resisting the temptation to call this simple---it's not! All we're doing is filtering an array of dicts for those whose 'value' is neither too large nor too small. I have no idea why it has to be this complicated. The example is from the NSPredicate docs (as reformatted and with renamed vars by me):

#import <Foundation/Foundation.h>

int main (int argc, const char * argv[]) {
    NSAutoreleasePool * pool = [[NSAutoreleasePool alloc] init];
    
    NSArray *A = [NSArray arrayWithObjects:
        [NSDictionary dictionaryWithObject:[
            NSNumber numberWithInt:5] forKey:@"value"],
        [NSDictionary dictionaryWithObject:[
            NSNumber numberWithInt:50] forKey:@"value"], 
        [NSDictionary dictionaryWithObject:[
            NSNumber numberWithInt:500] forKey:@"value"], 
        nil];

    NSExpression *lhs = [NSExpression 
        expressionForKeyPath:@"value"];
    NSExpression *gtrhs = [NSExpression 
        expressionForConstantValue:[NSNumber numberWithInt:10]];
    NSExpression *ltrhs = [NSExpression 
        expressionForConstantValue:[NSNumber numberWithInt:100]];
    
    NSPredicate *gtpred;
    gtpred = [NSComparisonPredicate 
        predicateWithLeftExpression:lhs
                    rightExpression:gtrhs 
                           modifier:NSDirectPredicateModifier 
                               type:NSGreaterThanOrEqualToPredicateOperatorType 
                            options:0];
    
    NSPredicate *ltpred;
    ltpred = [NSComparisonPredicate 
        predicateWithLeftExpression:lhs
                    rightExpression:ltrhs 
                           modifier:NSDirectPredicateModifier 
                               type:NSLessThanOrEqualToPredicateOperatorType 
                            options:0];
                                                                                
    NSPredicate *pred;
    pred = [NSCompoundPredicate andPredicateWithSubpredicates:
                [NSArray arrayWithObjects:gtpred, ltpred, nil] ];
                
    NSArray *fA = [A filteredArrayUsingPredicate:pred];
    for (id obj in fA) {
        NSLog(@"%@", [obj description]);
    }
    return 0;
}

> gcc -o test pred.m -framework Foundation
> ./test
2011-03-12 11:48:31.640 test[70403:903] {
    value = 50;
}

NSPredicate: simple Cocoa examples

I'm exploring predicates and expressions in Cocoa using Objective-C. Actually, I'd like to use PyObjC, but this is tricky enough that I think it's better to start with Objective-C.

Ultimately I'd like to write complex predicates, for example, to filter using either a custom function or a block. The reason I need this is to validate edits to an NSTableView using bindings. According to the docs:

A predicate is a logical operator that returns a Boolean value (true or false). There are two types of predicate; comparison predicates, and compound predicates:

● A comparison predicate compares two expressions using an operator. The expressions are referred to as the left hand side and the right hand side of the predicate (with the operator in the middle). A comparison predicate returns the result of invoking the operator with the results of evaluating the expressions.

● A compound predicate compares the results of evaluating two or more other other predicates, or negates another predicate.

All the code is like this:

#import <Foundation/Foundation.h>

int main (int argc, const char * argv[]) {
NSAutoreleasePool * pool = [[NSAutoreleasePool alloc] init];
.. specific code here ..
[pool drain];
return 0;
}

compiled and run like this:

> gcc -o test pred.m -framework Foundation
> ./test

Example 1. Predicate with a format string using "like"

(the c is a case-insensitive compare, the * is a wild card):

NSArray *A = [NSArray arrayWithObjects:
@"John",@"Paul",@"George",@"Ringo",nil];
NSPredicate *p = [NSPredicate predicateWithFormat:
@"SELF like [c] 'g*' "];    
NSLog(@"%@", [[A filteredArrayUsingPredicate:p] objectAtIndex:0]);

> ./test
2011-03-12 09:38:55.379 test[67230:903] George

Example 2. Format string with a mathematical operator

NSArray *A = [NSArray arrayWithObjects:
[NSNumber numberWithInt:3],
[NSNumber numberWithInt:5],nil];
NSPredicate *p = [NSPredicate predicateWithFormat:
@"SELF > 4 "];    
NSLog(@"%@", [[A filteredArrayUsingPredicate:p] objectAtIndex:0]);

> ./test
2011-03-12 09:38:22.313 test[67217:903] 5

Note: also works with the usual kind of string formatting :@"SELF > %d ", 4

Example 3: Format string using a key path

NSArray *A = [NSArray arrayWithObjects:
[NSMutableDictionary dictionaryWithObject:@"x" forKey:@"name"],
[NSMutableDictionary dictionaryWithObject:@"y" forKey:@"name"], nil];

NSPredicate *p = [NSPredicate predicateWithFormat:@"name like %@", @"x"];    
NSLog(@"%@", [[A filteredArrayUsingPredicate:p] objectAtIndex:0]);

> ./test
2011-03-12 10:01:43.963 test[67885:903] {
name = x;
}

Example 4: Format string substituting a dynamic key path

NSArray *A = [NSArray arrayWithObjects:
[NSMutableDictionary dictionaryWithObject:@"x" forKey:@"name"],
[NSMutableDictionary dictionaryWithObject:@"y" forKey:@"name"], nil];
//NSLog(@"%@", [[A objectAtIndex:0] description]);

NSString *name = @"name";
NSString *value = @"x";
NSPredicate *p = [NSPredicate predicateWithFormat:@"%K like %@", name, value];    
NSLog(@"%@", [[A filteredArrayUsingPredicate:p] objectAtIndex:0]);

> ./test
2011-03-12 09:29:59.220 test[67013:903] {
name = x;
}

Example 5. Compound predicate

NSArray *theKeys = [NSArray arrayWithObjects:@"name",@"value",nil];
NSArray *o1 = [NSArray arrayWithObjects:@"x",[NSNumber numberWithInt:5],nil];
NSArray *o2 = [NSArray arrayWithObjects:@"y",[NSNumber numberWithInt:7],nil];
NSArray *o3 = [NSArray arrayWithObjects:@"z",[NSNumber numberWithInt:9],nil];

NSArray *A = [NSArray arrayWithObjects:
[NSMutableDictionary dictionaryWithObjects:o1 forKeys:theKeys],
[NSMutableDictionary dictionaryWithObjects:o2 forKeys:theKeys], nil];
NSLog(@"%d", [A count]);

NSPredicate *p = [NSPredicate predicateWithFormat:
@"(value < %@) OR (name == %@)", [NSNumber numberWithInt:6], @"y"]; 
NSArray *fA = [A filteredArrayUsingPredicate:p];
NSLog(@"%d", [fA count]);
for (id obj in fA) {
NSLog(@"%@", [obj description]);
}

> ./test
2011-03-12 10:34:11.921 test[68778:903] 2
2011-03-12 10:34:11.924 test[68778:903] 2
2011-03-12 10:34:11.925 test[68778:903] {
name = x;
value = 5;
}
2011-03-12 10:34:11.925 test[68778:903] {
name = y;
value = 7;
}

Dental project (6)

I want to show some more results from this project, namely, the UniFrac analysis. What I did for the paper was to cluster very closely related sequences (alignment > 450 and 0 or 1 mismatches), then upload them to RDP, which aligns the sequences as they are uploaded. The phylogenetic tree needs to be rooted, and I decided to use Thermotoga SL7 for this (Genbank AJ401017).

Rather than deal with the clustered OTUs for this post, I just uploaded all 1120 sequences, and carried out the analysis. The first time through (today) I forgot to include the outgroup! So that gives us a chance to see how much difference it makes.

Working in directory: temp

Check that seqs.fna from dental project dir has 1120 seqs

Rename to dental_1120.fna

Upload to RDP

Download as dental_1120_rdp.fna

Use R/ape to make a tree

setwd('Desktop/temp/rdp')
library(ape)
dna = read.dna('dental_1120_rdp.fna',format='fasta')
tr = nj(dist.dna(dna))
plot(tr)
write.tree(tr,'tree.txt')

Write a simple script to make the environment file

python write_env.py > environ.txt

It looks like this:

DAA_44  D_DAA
DAA_43  D_DAA
DQ_209  D_DQ
..

Upload tree and environment file to UniFrac

PCA (unweighted)

View as data table

Download data to pca_web.txt

Run the script below to plot the data using matplotlib.

Here it is:

From the UniFrac FAQ:

My tree was not rooted, but I was able to upload my file and perform an analysis. Are the results valid?

There is no way to tell based on a Newick string alone whether a tree is rooted or not. If an unrooted tree is input, UniFrac will usually assign an arbitrary root and allow you to perform the analysis on that tree. How the tree is rooted can affect the results of both UniFrac tests and the P test. You should redo the analysis with a tree that is rooted with an appropriate outgroup.

It turns out to be easy enough..go back to RDP and browse to find SL7 and then add it to the sequence cart. Repeat the download to dental_1120+_rdp.fna. Load the last 5 sequences into clustalx.app and look to make sure that SL7 is really properly aligned.

setwd('Desktop/temp/rdp')
library(ape)
dna = read.dna('dental_1120+_rdp.fna',format='fasta')
tr = nj(dist.dna(dna))

> tr

Phylogenetic tree with 1121 tips and 1119 internal nodes.

Tip labels:
 DAA_44, DAA_43, DQ_209, DAA_45, DAA_40, DC_81, ...

Unrooted; includes branch lengths.

> grep('SL', tr$tip.label)
> 1121

Root the tree appropriately and write it to disk

tr2=root(tr,1121)
plot(tr2)
write.tree(tr2,'rooted_tree.txt')

Go back to UniFrac

Repeat the PCA. You can look at the data in a spreadsheet app:

Now I plot it in matplotlib. The first image is what I plotted today for the rooted tree. The second is from the paper. Looks pretty good to me. Also, note some minor differences from the previous graphic where the tree we used was unrooted (and UniFrac rooted it for us however it does when it's not properly rooted).

plot_web.py

import sys
import matplotlib.pyplot as plt
from fileUtilities import load_data

d = 0.5

fn = 'pca_web.txt'
data = load_data(fn)
data = data.strip().split('\n\n')[0]
data = data.strip().split('\n')[1:]
L = list()

for e in data:  
    name, x, y = e.split()[:3]
    x,y = float(x), float(y)
    x *= -1
    name = name[2:]
    if name[1] in 'BCM':  c = 'blue'
    else:  c = 'red'
    plt.scatter(x,y,s=100,color=c)
    if name == 'DG':
        y += 0.03
    plt.text(x+0.03,y-0.02,va='center',
             s=name[1:],color=c,fontsize=16)
    plt.plot((-d,d),(0,0),':',zorder=0)
    plt.plot((0,0),(-d,d),':',zorder=0)

ax = plt.axes()
ax.set_xlim(-d,d)
ax.set_ylim(-d,d)
plt.savefig('pca_web.png')

Dental project (5)

This post is one of a series (see dental project here or in the sidebar).

Last time I said I would show you how I make heatmaps these days. I've approached it several different ways over the past few years (R, Cocoa, matplotlib), but I think now that matplotlib is best, at least for me. Ultimately what I want is flexibility, and if you're a Python coder and you have matplotlib installed (as we've also discussed many times), then you'll have that. But I don't want to get into the technical details---and actually the script is a bit long, so I just put it (Heatmapper.py and its helper Preprocessor.py) into the zipped project files on Dropbox (here). The output from two different modes is at the bottom of the post. You just need a file data.csv in the same directory. It looks a little fuzzy and not as clean as I would like, but that's because there are so many samples, and partly because of the italic font. If you do savefig to a pdf file, and then blow it up, it looks great.

In this post I want to talk in a general way about the project and what I think it means. It began about four years ago, when we became aware that some folks in Dentistry at our school (WVU) were involved in a huge study of people from Appalachia (it's called COHRA). Poor oral health is a particular problem in West Virginia, and this study had collected thousands of samples along with patient histories and lots of clinical data. My belief is that the important thing about these samples is that the patients are young yet have serious periodontal issues. In any event, we convinced the people who actually run the project (based elsewhere) to let us have (a small part of) 8 samples out of all their thousands sitting in the freezers down the hall.

We did PCR with "universal" primers for the bacterial 16S rRNA gene, and cloned and sequenced the numbers you see in the table. It's not a big study (we don't have much money anyway), but we saw something which I think is truly significant. In high disease individuals, a broad group of microbes from the Clostridiales including an unusual clade called the Veillonellaceae are increased in abundance, whereas the sequences from control individuals in this clade were all very closely related to Veillonella parvula.

One reason this observation may be important is that the so-called "red complex", which is thought to be associated with serious periodontal disease, can only be recovered in about half the individuals with this diagnosis (not even considering abundance).

That story is in the modified version of the map above, where I drew a red box around the region of interest for the three controls, or "low disease" samples. Time went on, and another set of samples was added to the study from a different group, and we were able to get the work published. So that's why the study looks so old-fashioned, in an era of millions of reads, we've got about a thousand.

My role in all this was to actually do the analysis. I remember "we" wrote a grant (actually, someone else did!) and listed me as a technical expert in bioinformatics. Of course, the reviews were scathing. Dr. E doesn't have a degree in bioinfomatics. How could he know anything?

Well, I've learned a few things over the years. Rule one is, never make your own database: let someone else do it. That's why HOMD (and Greengenes and RDP) are so great. I particularly like the tools at the RDP site. It is very nice software.

And rule two is, if you live long enough, you will see work that took you months or years to accomplish be achieved using new tools in mere seconds or hours. Sequencing is a great example of this. When I was young I spent most of three months getting 500 bp; when I was a bit older I invested six months for 3.5 kb; still later it was a year for 20 kb.

This project is another example. I spent a year and more writing some 50 or so Python scripts (and rewriting them), and now QIIME does the whole thing in mere seconds.

Well, not quite the whole thing. I have a bit more to do with this project. I want to show you the UniFrac analysis of beta diversity, and show how to make what I think is a nicer plot of the PCoA results. Also, I want to show some phylogenetic trees detailing the increased diversity (species richness, really) in the Veillonellaceae that I mentioned.

And I should say: it's been fun. Even if I don't have that degree, or any papers with Rob Knight, I think I've learned something about Bioinformatics in the last 5 years.

Thursday, March 10, 2011

Flag update

Dental project (4)

This post is one of a series (see dental project here or in the sidebar).

After getting a set of sequences and removing chimeras, the next step is almost anticlimactic. We just copy a modified version of the shell script (from here, without the cd calls) or paste in the commands working from the dental directory (either individually, or all at once):

#!/bin/bash

pick_otus.py -i seqs.fna -m uclust -s 0.97 -o otus
pick_rep_set.py -f seqs.fna -i otus/seqs_otus.txt -m most_abundant -o otus/reps.txt
align_seqs.py -i otus/reps.txt -m pynast -t ~/data/core.txt -o aln
assign_taxonomy.py -i otus/reps.txt -m rdp -o tax
filter_alignment.py -i aln/reps_aligned.txt -m ~/data/mask.txt -o aln2
make_phylogeny.py -i aln2/reps_aligned_pfiltered.fasta -o figs/tree.tre
make_otu_table.py -i otus/seqs_otus.txt -t tax/reps_tax_assignments.txt -o figs/otu_table.txt
summarize_taxa.py -i figs/otu_table.txt -o figs/otu_table_Level3.txt -L 3
plot_taxa_summary.py -i figs/otu_table_Level3.txt -l Phylum -o figs -k white
make_otu_heatmap_html.py -i figs/otu_table.txt -o figs

It's all over in a few seconds.

The heatmap QIIME produced is at the top of the post. It is truly a remarkable html page, with a graphic where you can reorder the columns or rows by drag and drop, and redo the map at different threshholds for the OTUs, etc. I've never seen anything quite like it. But (and this is just me), it's not pretty enough.

So what I'd like to do from here is to show you how I currently make heatmaps with matplotlib, and we'll get into that next time.

First, I have to extract the data from QIIME. The script is complicated a bit by an additional job: I'm going to organize the rows and columns. (QIIME can do this too---see the tutorial).

The columns will be in the order they appear in sample_names.txt and the rows as they appear in genera_and_colors.txt. These files are in the same directory. The second one starts like this:

# Bacteria black
Bacteria
# Bacteroidetes green
Bacteroidetes
Bacteroidales
Rikenella
Prevotella
Porphyromonas
Saprospiraceae
Sphingobacteriales

I just do this:

> python grab_qiime_data.py > data.csv

Here's the first part of data.csv:

,DB,DC,DM,DF,DL,DG,DI,DA,DT,DQ,DV,DAA,DZ
Bacteria,,,,2,,,,,,,,,12
Bacteroidetes,,,,,,2,,,,,,,
Bacteroidales,,,,,,1,,,,,,,
Prevotella,,,,,3,2,,1,,,,,
Porphyromonas,,,,,,1,,1,,,,,
Sphingobacteriales,,,,,,,,,,1,,,
Capnocytophaga,9,,6,2,3,6,1,19,31,35,2,8,7
Fusobacterium,1,,,4,,2,1,2,,2,1,3,1

The leading comma on line 1 is so the column headers line up properly in a spreadsheet. Speaking of spreadsheets, here is a screenshot after dropping the data.csv file onto Numbers (you could use Excel, of course):

That was painless! Zipped project files in Dropbox (here).

Wednesday, March 9, 2011

Dental project (3)

wikimedia

Before starting on analysis of the 1124 sequences from last time (here), we need to check for chimeras.

And at this point, I have a confession to make. It turns out there are 3 and perhaps 4 chimeras in the set of sequences from Genbank. I discovered this unwelcome fact a few weeks ago when playing with the QIIME toolkit. Since one of the pieces of software they recommend is ChimeraSlayer, I tried it out on these sequences.

Make a directory temp with a copy of seqs.fna. The sequences first need to be converted to NAST format, then we can run ChimeraSlayer.pl.

prog1=~/Software/microbiomeutil_2010-11-02/NAST-iEr/run_NAST-iEr.pl
$prog1 --query_FASTA seqs.fna > seqs.nast

prog2=~/Software/microbiomeutil_2010-11-02/ChimeraSlayer/ChimeraSlayer.pl
$prog2 --query_NAST seqs.nast

It takes the better part of an hour on my slowest machine (a 5 year old iMac).

seqs.nast.CPS.CPC.wTaxons has flagged four sequences:

DA228 INTRA-GENUS
DQ822 INTRA-PHYLUM
DV55 INTRA-FAMILY
DAA89 INTRA-FAMILY

I grab those four by hand into a new file suspects.fna (there is probably a better way) and do:

$prog1 --query_FASTA suspects.fna > suspects.nast
$prog2 --query_NAST suspects.nast --printCSalignments option

The output shows there is definitely a problem. In suspects.nast.CPS.CPC.wTaxons we have:

ChimeraSlayer DQ_822 S000427388 S000260335 1.0566 98.74 100 0.7978 74.56 0 YES NAST:1861-1863 ECO:324-325 Streptococcus Streptococcus cristatus (T); NCTC12479; AB008313 Streptococcus cristatus Lachnospiraceae Incertae Sedis Clostridium aerotolerans (T); DSM 5434; X76163 Clostridium aerotolerans INTRA-PHYLUM
            Per_id parents: 73.80

           Per_id(Q,A): 93.45
--------------------------------------------------- A: S000427388
 99.64                                78.63
~~~~~~~~~~~~~~~~~~~~~~~~\ /~~~~~~~~~~~~~~~~~~~~~~~~ Q: DQ_822
DivR: 1.057 BS: 100.00     |
Per_id(QLA,QRB): 98.74   |
                         |
   (L-AB: 72.50)         |      (R-AB: 76.92)
   WinL:0-279            |      WinR:280-396
                         |
Per_id(QLB,QRA): 74.56   |
DivR: 0.798 BS: 0.00    |
~~~~~~~~~~~~~~~~~~~~~~~~/ \~~~~~~~~~~~~~~~~~~~~~~~~~ Q: DQ_822
 72.86                                96.58
---------------------------------------------------- B: S000260335
            Per_id(Q,B): 79.85

DeltaL: 26.79                   DeltaR: -17.95

                                            !               
CTAACGAGGAGGCGCTTGGTTAAGGGCTAGCTAAATTGCATGATGGTCAATGGGAAATCC  A: S000427388
CTAACGAGGAGGCGCTTGGTTAAGGGCTAGCTAAATTGCATGATAGTCAATGGGAAATCC  Q: DQ_822
TCCGACTAAGATTCGGAATCGGGCACAAGATCTCGACAGGCAGCACAGTGGAACTCCGGT  B: S000260335
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! !!!!!!!!!!!!!!!

                 
ACCCTTGTCGTGACATC  A: S000427388
ACCCTTGTCGTGACATC  Q: DQ_822
GGTGCCTGGACCCAGCT  B: S000260335
!!!!!!!!!!!!!!!!!

  ** Breakpoint **

!!!!!!!!!!!!!  !!!!! !!!!!!!
GACCGACGGTCGAGTGTATCGGGGTAAA  A: S000427388
AGAGAGGACCTCGGTTATATGACAGCGG  Q: DQ_822
AGAGAGGACCTCGAGTATATCACAGCAG  B: S000260335
             !!     !     !

The first match is great for a while, then terrible, and the second is the converse.

I need to look into whether I should update the Genbank records, but I guess probably the answer is yes.

Anyway, I should have discovered this easily. I wrote a Python tool that looks for chimeras by BLAST of the front and back "halves" of each sequence against our local "boutique" database. It prints the top five hits for each. Here is the output for three of the suspects:

>DQ822
BLAST front(len = 207):
320        207/207  100.00  Streptococcus_clone_BP2-57_AB121930.1
319        205/207   99.03  Streptococcus_clone_502H08_AM420202.1
323        203/207   98.07  Streptococcus_cristatus_AB008313.1
321        200/207   96.62  Streptococcus_clone_BW009_AY005042.1
334        199/209   95.22  Streptococcus_sanguinis_SK36_SK36
BLAST back(len = 187):
349        179/187   95.72  Uncultured_clone_4.59_DQ346409.1
79         179/187   95.72  Clostridiales_clone_301C11_AM420062.1
121        178/187   95.19  Eubacterium_clone_DO008_AF385508.1
117        178/187   95.19  Eubacterium_clone_BP2-88_AB121960.1
115        178/187   95.19  Eubacterium_clone_BL026B96_AY806377.1

>DV55
BLAST front(len = 210):
353        207/210   98.57  Uncultured_clone_E105_DQ326659.1
100        206/210   98.10  Dialister_sp._E2_20_AF481209.1
94         205/210   97.62  Dialister_invisus_AY162469.1  
31         190/207   91.79  Allisonella_clone_BL34_DQ130020.1
93         192/210   91.43  Dialister_clone_MCE7_134_AF481210.1
BLAST back(len = 190):
373        190/190  100.00  Veillonella_parvula_X84005.1  
372        190/190  100.00  Veillonella_clone_X042_AF287781.1
370        190/190  100.00  Veillonella_clone_BU083_AF366266.1
369        190/190  100.00  Veillonella_clone_AA050_AF287782.1
371        182/183   99.45  Veillonella_clone_R1_DQ123569.1

>DAA89
BLAST front(len = 210):
283        209/210   99.52  Selenomonas_clone_CI002_AF287798.1
286        202/210   96.19  Selenomonas_clone_EQ054_AF385495.1
288        201/210   95.71  Selenomonas_clone_FT050_AY349403.1
298        189/195   96.92  Selenomonas_noxia_AF287799.1  
297        200/210   95.24  Selenomonas_infelix_AF287802.1
BLAST back(len = 190):
373        184/187   98.40  Veillonella_parvula_X84005.1  
370        184/187   98.40  Veillonella_clone_BU083_AF366266.1
372        181/184   98.37  Veillonella_clone_X042_AF287781.1
369        181/184   98.37  Veillonella_clone_AA050_AF287782.1
371        182/187   97.33  Veillonella_clone_R1_DQ123569.1

Note on sequence titles: I just introduced the underscore recently (as in DA_228), so this output doesn't have them.

It's pretty obvious that these guys are problematic. What happened is that I integrated the tool into the toolchain, but I never wrote code to look through the output and flag potential problems. I always did it manually, and as additional sequence samples were added to the experiment, I forgot to carry out this step.

Moral of the story: if you want to be sure something gets done, every time, you need to automate it completely! Otherwise you might forget.

We'll remove these from our sequence file by hand. Now there are 1120.

DA_228
DQ_822
DV_55
DAA_89

David Broder

I'm a political junkie. Can't help it. I start the day with Ezra Klein (after he appears online), and perhaps, end the day with Josh Marshall's gang. So it is with sadness that I hear of David Broder's passing (no link, the Post obit runs an ad).

There was a time in the 90s when I watched Washington Week in Review (when it was something, with Ken Bode as the moderator) and Broder was always impressive. I think he was the last of the giants---think David Brinkley, Walter Cronkite, Charles Kurault.

Now we're left with screeching vapid idiots like---well, anyone on CNN except David Gergen. I miss the time when serious analysis could be found on TV. Sigh. Perhaps a good analogy is this: think of the difference between Hubie Brown (or Doug Collins) and anyone else who calls basketball games. I love Hubie, he's entertaining, but even more important, you can actually learn something.

Dental project (2)

This is a series, first post here. Before we do anything else, we need to clean up the titles on the sequences. They come from Genbank like this:

>gi|324104022|gb|HQ894465.1| Uncultured bacterium clone DA19 16S ribosomal RNA gene, partial sequence

We want this:

>DA_19

.

We could do something like a regular expression, but that's overkill. Note that the alpha part is variable length, so we have to be a little bit smart. But these are so regular, it's easy. Also, UniFrac wants an underscore, so we add that.

Just do this from the command line:

> python retitle.py > seqs.txt

If you count the sequences, you should have 1124.

retitle.py

import utils as ut
digits = '0123456789'

data = ut.load_data('results.txt')
data = data.strip().split('\n\n')
for item in data:
    title,seq = ut.clean_fasta(item)
    e = title.split()[4]
    d = ''.join([c for c in e if c in digits])
    s = ''.join([c for c in e if not c in digits])
    print '>' + s + '_' + d
    print seq
    print