Sunday, March 27, 2011

Equation of the ellipse



Here is a simple derivation of the equation for an ellipse. It is taken from Morris Kline's book: Calculus, An intuitive and Physical Approach.

We place the two foci of an ellipse (F and F') at the coordinates (c,0) and (-c,0). Each point on the ellipse is defined by the property that the sum of the distances to F and to F' is constant, which we define as equal to 2a. Our old friend Pythagoras helps us find the distances in terms of x,y and c.


PF  = √[y2 + (x-c)2]
PF' = √[y2 + (x+c)2]

2a = PF + PF'
PF = 2a - PF'
√[y2 + (x-c)2] = 2a - √[y2 + (x+c)2]


Square both sides (and expand):


y2 + x2 - 2xc + c2 = 4a2 - 4a √[y2 + (x+c)2] + y2 + x2 + 2xc + c2


Cancel terms (x2, y2, and c2) and rearrange to isolate the remaining square root:


- 2xc = 4a2 - 4a √[y2 + (x+c)2] + 2xc
4a √[y2 + (x+c)2] = 4a2 + 4xc
a √[y2 + (x+c)2] = a2 + xc


Square both sides again and expand:


a2 (y2 + x2 + 2xc + c2) = a4 + 2a2xc + x2c2
a2y2 + a2x2 + a2c2 = a4 + x2c2
a2x2 - x2c2 + a2y2 = a4 - a2c2


Factor out a2 - c2:


x2 (a2 - c2) + a2y2 = a2(a2 - c2)


Define b2 = a2 - c2


b2x2 + a2y2 = a2b2


Divide to obtain the familiar form:


x2/a2 + y2/b2 = 1


Note that when


x = 0, y = +/- b
y = 0, x = +/- a


The squaring method allows the possibility that the simplified equation has solutions that are not valid for the original version. This turns out not to be the case, and Kline deals with this issue in the book.

Also, we might note that since

b2 = a2 - c2

if a and b are fixed, then c is determined. For the figure shown here, b = c and hence a = √2 b.

UPDATE: A virtually identical proof can be found in Wikipedia (here)---should have looked first.

Flags, detail

A quick note about the flags. There are enough now that it's hard to be sure I got them all. So.. I went to the Flag Counter site (this page, and the next). Rather than do anything fancy, I just copied the text to a file and then processed it with Python. My database of flag images is from Wikipedia, and I shortened the file names by the country code. Since I can't remember a number of them, I wrote a Python script to harvest the Flag Counter entries and match them with the country codes (from here).

I checked the directory with the flag images by eye, which is almost certainly a mistake.

Here is the list of countries from which visitors to this site have come, in alphabetical order. The script is at the end. It shows a number of typical issues you run into with this kind of processing.


AE  United Arab Emirates
AL Albania
AN Netherlands Antilles
AR Argentina
AT Austria
AU Australia
BB Barbados
BE Belgium
BG Bulgaria
BH Bahrain
BR Brazil
BY Belarus
CA Canada
CH Switzerland
CL Chile
CN China
CO Colombia
CR Costa Rica
CS Serbia
CV Cape Verde
CY Cyprus
CZ Czech Republic
DE Germany
DK Denmark
DZ Algeria
EC Ecuador
EE Estonia
EG Egypt
ES Spain
FI Finland
FR France
GH Ghana
GR Greece
HK Hong Kong
HR Croatia
HU Hungary
ID Indonesia
IE Ireland
IL Israel
IN India
IS Iceland
IT Italy
JM Jamaica
JP Japan
KR South Korea
LT Lithuania
LU Luxembourg
MA Morocco
MD Moldova
MT Malta
MU Mauritius
MX Mexico
MY Malaysia
NL Netherlands
NO Norway
NZ New Zealand
PA Panama
PE Peru
PH Philippines
PK Pakistan
PL Poland
PR Puerto Rico
PT Portugal
QA Qatar
RO Romania
RU Russia
SA Saudi Arabia
SE Sweden
SG Singapore
SI Slovenia
SK Slovakia
SV El Salvador
TH Thailand
TN Tunisia
TR Turkey
TT Trinidad and Tobago
TW Taiwan
UA Ukraine
UK United Kingdom
US United States
UY Uruguay
VE Venezuela
VN Vietnam
ZA South Africa



from utils import load_data

specials = { 'South_Korea':'Korea_(South)',
'Russia':'Russian_Federation',
'New_Zealand':'New_Zealand_(Aotearoa)',
'Serbia':'Serbia_and_Montenegro',
'Croatia':'Croatia_(Hrvatska)',
'Vietnam':'Viet_Nam' }

data = load_data('country-codes.txt')
D = dict()
for line in data.strip().split('\n'):
L = line.strip().split()
D['_'.join(L[1:])] = L[0]

cL = list()
data = load_data('scraped.txt')
for line in data.strip().split('\n'):
L = line.strip().split()
i = len(L) - 4
country = '_'.join(L[1:i])
cL.append(country)
if country in specials:
k = specials[country]
D[country] = D[k]

def f(k): return D[k]
for country in sorted(cL, key=f):
print D[country],'\t', country.replace('_',' ')

Flag update


We continue to accumulate unique visitors from new countries. Here are the flags of 16 more.

Thanks for reading!

Friday, March 18, 2011

nothing in biology makes sense except in the light of evolution

I've been reading Mike Yarus's book: Life from an RNA world (Amazon). It's a very readable account of evolution from the perspective of sequences and the RNA world. Mike's a highly intelligent guy, and that and his wit inform every page. In one chapter, he gets Rob Knight and Steve Freeland to do an evolution simulation.

we let a computer write out a random string, mutate 1 in 100 characters in each generation, and select changes only if they match Dobzhansky


In my version:
  • in each generation we pick one position
  • if it already matches Dobzhansky, continue to the next generation
  • mutate to a random choice from the set of symbols

    This isn't a very good model of evolution. It was just fun to spend a few minutes coding it.

    Here is the beginning, middle and end of one run:


    > python evolve.py 
    0 LVSTUknwIfGIRHFgHESZ M zthWtNQTk qgtoeMvjeJAzOidKWEZO ZwNjDyCvq
    100 LVSTUknwIfGIRHFgHESZ M zthetNQTk qgtoeMvjeJAzOidKWEZO ZwNjDyCvq
    200 LVSTUknwIfGIRHFgHESZ M zthetNQTk qgtoeMvjeJAzOidhWEZO ZwNjDyCvq
    300 LVSTiknwIfGIRHFgHESZ M zthetNQ k qgtoeMvjeJAzOidhWEZO ZwNjuyivq

    6100 notTing if biHlogy makes sense except iv the Oight of Zvolution
    6200 notTing if biHlogy makes sense except iv the Oight of Zvolution
    6300 nothing if biHlogy makes sense except iv the Oight of Zvolution
    6400 nothing if biHlogy makes sense except iv the Oight of Zvolution

    13800 nothing in biHlogy makes sense except in the light of evolution
    13900 nothing in biHlogy makes sense except in the light of evolution
    14000 nothing in biHlogy makes sense except in the light of evolution
    14100 nothing in biHlogy makes sense except in the light of evolution

    14167 nothing in biology makes sense except in the light of evolution
    14167 nothing in biology makes sense except in the light of evolution



    import random
    s = 'nothing in biology makes sense'
    s += ' except in the light of evolution'
    N = len(s)

    symbols = 'abcdefghijklmnopqrstuvwxyz '
    symbols += symbols.upper()
    L = [random.choice(symbols) for i in range(N)]
    print ' 0 ' + ''.join(L)

    i = 0
    while L != list(s):
    i += 1
    v = i and not i % 100
    if v: print str(i).rjust(5),
    j = random.choice(range(N))
    if L[j] == s[j]:
    if v:
    print ''.join(L)
    continue
    c = random.choice(symbols)
    if c == s[j]: L[j] = c
    if v: print ''.join(L)

    print
    print str(i).rjust(5), s
    print str(i).rjust(5), ''.join(L)
  • Mutual Information (3)


    We're working on a paper from Michael Laub's lab at MIT (Skerker et al 2009 PMID 18555780). Previous posts here and here.

    Now it's time to analyze the data. The self comparisons (a single position in the alignment, analyzed for mutual information against itself) yield info values ranging from 4.1 to 0.01.


     90 4.102
    260 4.025
    122 3.993
    209 3.976
    234 3.944
    101 3.939
    235 3.938
    152 3.935
    63 3.91
    222 3.887

    159 0.191
    250 0.138
    132 0.133
    291 0.086
    199 0.067
    242 0.052
    16 0.012
    106 0.012
    156 0.012
    158 0.012


    As you can see in the screenshot, the residue in column 16 (1-based indexing), which has a very low score, is the conserved (catalytic) histidine.



    We'll filter out the self comparisons for the plots. Here is the histogram of information values that I got. It's a bit different from the paper, but not much.



    In the next part, we'll try to match up the residues which look like they might interact (that have high mutual information) and see if that makes sense in terms of the protein structures.

    I picked a familiar HK/RR pair to do this: EnvZ and OmpR. This screenshot of part of Fig 2 shows a section of each protein.



    Searching in the alignment file (the alignments don't have titles that I recognize), I recovered a sequence that I think is probably the right one. It matches the Figure as far as I checked:


    >26250004-26250005/1-457
    KQLADDRTLLMAGVSHDLRTPLTRI
    RLATEMMSAESINKDIEECNAIIEQ
    FIDYLRTGMADLNAVLGEVIA--AE
    SGYEREIETAL-YVKMHPLSIKRAV
    ANMVVNAARY-GNGWIKVSSGTEAW
    FQVEDDGPGIAPEQRKHLFQPFVRG
    DISGTGLGLAIVQRIVDNHNGMLEL
    GTSERGGLSIRAWLPNYKILVVDDD
    MRLRALLERYLTEQGFQVRSVANAE
    QMDRLLTRESFHLMVLDLMLPGEDG
    LSICRRLRSQSPMPIIMVTAKGEEV
    DRIVGLEIGADDYIPKPFNPRELLA
    RIRAVLRR


    I put the newlines in to help me count. The alignment is 308 residues total. From Fig 2,

    EnvZ (the HK) sequence starts with:


    AGVKQLADDRTLLMAGVSHDL
    KQLADDRTLLMAGVSHDL


    The second line above is from the alignment. The sequence doesn't start at residue 1, however. By my calculation, residue 0 in the alignment (the K) is residue 15 in the protein (1-based index). So we'll adjust the indexes we obtain by adding 15 to compare with the actual protein sequence.

    OmpR (the RR) sequence starts with


    MQENYKILVVDDDMRLRALLER
    NYKILVVDDDMRLRALLER


    The second line above is from the alignment, where there's an N at position 0 in this fragment. By my calculation, that N is residue 3 of OmpR in Fig 2 (1-based index), and residue 190 of the alignment, so we subtract 187 for values >= 190.

    Going back to the plotting script, we filter the data for pairs in which one comes from the HK and one from the RR, and print the top 20. We do the math as outlined above. As a check, we grab the putative EnvZ/OmpR sequence from the alignment file and print the sequence starting at the position we've identified. Here are the results for the top 20 (in each pair the RR is printed first and the HK second).

    [Note: to keep the code simple, I ignored the situation where a column contains '-' a gap in the alignment for some of the sequences. That's what's giving us the two strange results below at position 16 and 20.]


    > python plot.py 
    14 RLRAL 42 ATEMM 0.822
    18 LLERY 42 ATEMM 0.816
    22 YLTEQ 42 ATEMM 0.769
    15 LRALL 37 TRIRL 0.730
    15 LRALL 42 ATEMM 0.694
    56 MLPGE 54 DIEEC 0.688
    14 RLRAL 54 DIEEC 0.678
    21 RYLTE 42 ATEMM 0.677
    14 RLRAL 38 RIRLA 0.668
    83 KGEEV 22 TLLMA 0.667
    22 YLTEQ 21 RTLLM 0.663
    4 YKILV 42 ATEMM 0.658
    83 KGEEV 54 DIEEC 0.657
    22 YLTEQ 18 ADDRT 0.650
    22 YLTEQ 54 DIEEC 0.646
    18 LLERY 86 --AES 0.644
    15 LRALL 54 DIEEC 0.643
    18 LLERY 38 RIRLA 0.636
    22 YLTEQ 45 MMSAE 0.633
    22 YLTEQ 86 --AES 0.631


    The two residues with highest mutual information are OmpR residue 14 and EnvZ residue 42. It looks pretty good to me.

    [ UPDATE: The heatmap looks pretty boring, so I'm going to skip it.
    But I plotted the top 20 interactions (graphic at the top of the post), the repetition indicates we're on the right track.. ]


    import sys
    from utils import load_data
    import matplotlib.pyplot as plt

    data = load_data('results.2.txt')
    data = data.strip().split('\n')
    data = [e.split() for e in data]
    data = [(int(t[0]), int(t[1]), float(t[2])) for t in data]

    def f(t): return float(t[2])
    def part1():
    L = [t for t in data if t[0] == t[1]]
    L = sorted(L,key=f,reverse=True)
    for t in L[:10]: print str(t[0]+1).rjust(3),round(t[2],3)
    print
    for t in L[-10:]: print str(t[0]+1).rjust(3),round(t[2],3)
    sys.exit()

    #part1()


    L = [t[2] for t in data if t[0] != t[1]]
    X = 1.0
    plt.hist(L,bins=X*50)
    ax = plt.axes()
    ax.set_xlim(0,X)
    plt.savefig('example.png')

    # t[0] always > t[1]
    N = 190
    data = [t for t in data if t[0] >= N and t[1] < N]

    aln = load_data('cell3925mmc4.fa')
    aln = aln.strip().split('>')[1:]
    aln = [e for e in aln if e.startswith('26250004-26250005/1-457')]
    envZ_ompR = aln[0].split('\n')[1]

    for t in sorted(data,key=f,reverse=True)[:20]:
    i = t[0]
    rr = i - 187
    j = t[1]
    hk = j + 15
    print str(rr).rjust(3), envZ_ompR[i:i+5],
    print str(hk).rjust(3), envZ_ompR[j:j+5],
    print '%3.3f' % t[2]

    Mutual information (2)

    We're working on a paper from Michael Laub's lab at MIT (Skerker et al 2009 PMID 18555780). The first post is here.

    In this part we'll load the alignment (supplementary data file S4---the annotation on the page is incorrect), and crunch the numbers. I just write the results to disk.

    We'll do the analysis in another post.

    python info.py > results.txt


    import sys
    from utils import load_data
    import info_helper as ih

    #aln = 'AASSASSTTT\nNMWWNTTKKS\nGTSNTYRSTA\nGGGGGGGGGG'
    fn = 'cell3925mmc4.fa'
    data = load_data(fn)
    data = data.strip().split('>')[1:]
    data = [e.split('\n')[1].strip() for e in data]

    def show(data):
    print 'starting:', len(data)
    for i in range(7):
    print i,
    L = [e for e in data if e.count('-') <= i]
    print len(L)
    sys.exit()
    #show(data)

    def transpose(L):
    R = range(len(L[0]))
    rL = list()
    for i in R:
    rL.append(''.join([item[i] for item in L]))
    return rL

    data = [e for e in data if e.count('-') <= 4]
    #data = data[:100]
    cols = transpose(data)
    pD = ih.make_prob_dict(cols)
    info = dict()

    for i in range(len(cols)):
    for j in range(i+1):
    info[(i,j)] = ih.get_info(i,j,cols,pD,v=False)
    for i,j in sorted(info.keys()):
    print i,j,round(info[(i,j)],3)

    Thursday, March 17, 2011

    Fun with geometry (1)



    I found a a couple of fun books of problems in geometry, algebra and probability (geometry book here).

    This is one of the problems: given the red circles with radius one-half the large black circle, and the blue circle inscribed so as to just fit inside, derive a relation between the radius of the blue circle and the others. This had me scratching my head for a few minutes before the aha moment.

    The challenge question is perhaps easier: prove that the filled-in gray area is equal to the area of one of the red circles.

    And a hint for the first problem comes from the next graphic, where I've made a copy of the blue circle and positioned it strategically:


    Tuesday, March 15, 2011

    Mutual information

    I want to talk about a really nice paper from Michael Laub's lab at MIT (Skerker et al 2009 PMID 18555780). It'll give us an opportunity to exercise our matplotlib skills.

    We're going to try to recreate Fig 1, which is visible in the PubMed page, or you can get the original paper from the link to Cell.

    Two-component signal transduction systems are ubiquitous in bacteria (wikipedia). The canonical design consists of a membrane-bound sensor (histidine) kinase (HK) and a cytoplasmic response regulator (RR). E. coli contains about 30 such pairs. The members of each pair have substantial specificity. The HK of the ntr system has specificity for its own RR, and likewise in the phoBR system, phoB has specificity for phoR. We may speak of a HK and its cognate RR.

    For our purposes the important thing is that each system comprises (in the simplest design) two protein partners with complementary surfaces. These systems (a pair of proteins) are the products of ancient gene duplication events, and have since diverged over time. Amino acids at interacting sites are constrained to co-evolve in each pair.

    If this sounds too vague or too complicated, consider an even simpler example: a stem of paired RNA residues in rRNA named H15.

    Here is the H15 sequence in 1D (the parentheses indicate residues involved in pairing---see the link above for details):


    ((((   (((((    ))))) ))))
    TGCACAATGGGCGCAAGCCTGATGCA


    And here is the inner stem drawn in 2D to show the base-pairing more directly:


    TGGGC
    GTCCG


    The base-pairing of this stem is more important to rRNA function than the identities of the bases. The result is that in some bacteria the identities of the central bases have been switched:


    original   co-evolved

    TGGGC TGCGC
    GTCCG GTGCG


    Presumably this happened in 2 discrete steps, but I don't know of any examples where the intermediate state has been preserved. Maybe we should look for some, and it's undoubtedly been studied.

    To quantify this kind of coevolution, we'll draw on a concept (and mathematical definition) called mutual information. The steps in the calculation will be:

  • make a multiple sequence alignment
  • compare column X and column Y
  • total number of sequences (length of each column) = c

  • for each residue x in column X calculate px
  • for each residue y in column Y calculate py
  • px is the probability of residue x in column X

    We'll write the columns horizontally to save space.
    Suppose column X and Y are:


    X:  AASSASSTTT
    Y: NMWWNTTKKS


    For column X we have:
    pA = 0.3 (3 A out of a total of 10 residues)
    pS = 0.4
    pT = 0.3

    For column Y:

    pK = 0.2
    pM = 0.1
    pN = 0.2
    pS = 0.1
    pT = 0.2
    pW = 0.2

    We pre-calculate these values for each column. When we calculate the information, we'll refer to the probabilities for column Y as q rather than p, to keep them straight from the p's for column X.

    Now, we consider each pair of residues, one from column X and one from column Y. This pair is made up of residues in two interacting protein surfaces or rRNA chains, that may have co-evolved.

  • pxy is the number of sequences with x in column X and y in column Y
  • divided by c, the total number of pairs:


    X:  AASSASSTTT
    Y: NMWWNTTKKS


    pAM = 0.1
    pAN = 0.2
    pST = 0.2
    pSW = 0.2
    pTK = 0.2
    pTS = 0.1

    Finally, to compute the mutual information for this pair of columns, we do this calculation for each individual pair of residues and then sum:

    pAM * log (pAM / pAqM) = 0.1 * log (0.1 / (0.3 * 0.1)) = 0.0523
    I would have used log2, but Skerker et al used log10, so I matched them.

    Here is part of the output of the script below:


    NMWWNTTKKS
    AASSASSTTT
    pKT 0.2 pK 0.2 qT 0.3 temp 3.33 final 0.1
    pMA 0.1 pM 0.1 qA 0.3 temp 3.33 final 0.05
    pNA 0.2 pN 0.2 qA 0.3 temp 3.33 final 0.1
    pST 0.1 pS 0.1 qT 0.3 temp 3.33 final 0.05
    pTS 0.2 pT 0.2 qS 0.4 temp 2.50 final 0.08
    pWS 0.2 pW 0.2 qS 0.4 temp 2.50 final 0.08
    info 0.47


    temp is the result of the calculation inside the parentheses, above. Next time we'll apply this method to the data from Skerker.


    from math import log
    def log2(n): return log(n)*1.0/log(2)
    def log10(n): return log(n)*1.0/log(10)

    # cache character probabilities for each column
    def make_prob_dict(cols):
    # input is a list of columns
    # c is the number of sequences in the alignment
    c = len(cols[0])
    pD = list()
    for col in cols:
    char_kinds = list(set(col))
    values = [col.count(k)*1.0/c for k in char_kinds]
    pD.append(dict(zip(char_kinds,values)))
    return pD

    def get_info(i,j,cols,pD,v=False):
    col1, col2 = cols[i], cols[j]
    if v: print col1 + '\n' + col2
    # as before, c is the number of sequences
    c = len(col1)
    assert c == len(col2)
    info = 0
    pairs = [col1[k] + col2[k] for k in range(c)]
    pL = sorted(list(set(pairs)))
    for p in pL:
    pXY = pairs.count(p)*1.0/c
    pX = pD[i][p[0]]
    pY = pD[j][p[1]]
    inside = (pXY * 1.0) / (pX * pY)
    if v: print 'p' + p, pXY,
    if v: print 'p' + p[0], pX,
    if v: print 'q' + p[1], pY,
    if v: print 'temp', '%3.2f' % round(inside, 2),
    outside = pXY * log10(inside)
    if v: print 'final', round(outside,2)
    info += outside
    if v: print 'info', round(info,2)
    return info

    if __name__ == '__main__':
    aln = 'AASSASSTTT\nNMWWNTTKKS\nGTSNTYRSTA\nGGGGGGGGGG'
    cols = aln.split('\n')
    pD = make_prob_dict(cols)
    info = dict()
    for i in range(len(cols)):
    for j in range(i):
    info[(i,j)] = get_info(i,j,cols,pD,v=True)
  • Sunday, March 13, 2011

    Cocoa: where to start?

    Recommended resources for beginning Cocoa with Objective-C
    NSOrderedDescending

  • do the temperature converter in the Cocoa Application Tutorial
  • short articles at Cocoa Dev (here much more here; C review)
  • Aaron Hillegass's book
  • Cocoa Fundamentals Guide (here)

    Specific to PyObjC:

  • Will Larson's tutorials (here here here here here)
  • Apple's page including a version of the temperature converter
  • Read the official introduction carefully (my biggest problem)

    A page of links to old material with simple demos of specific Cocoa features that mostly still work here.

    Code a simple game like TicTacToe, Fifteen, or Color Sudoku.

    After that, I've got tons of projects here and here. Get started with bindings (here), then move on to Vlad the Impaler (here).

    Learn specific topics by reading the Apple docs (slowly and repeatedly, it can take a while to get it, by building a simple demo project that does only that one thing. Like NSPredicate which we've done in six posts (here here here here here & one to come).

    And if you want the PyObjC templates for Xcode see here.
  • NSPredicate: PyObjC version

    Here is the answer in the previous post, converted to PyObjC. It's a bit simpler, except that some of the methods require a "real" NSArray or NSString.


    from Foundation import *
    import objc

    class NSString(objc.Category(NSString)):
    def validate(self):
    A = self.UTF8String().split(':')
    fm = NSFileManager.defaultManager()
    def f(s):
    print 'test', s
    home = NSHomeDirectory()
    s = s.replace('~',home)
    return fm.fileExistsAtPath_(s)
    return all([f(s) for s in A])

    s = NSString.stringWithString_('~/Desktop')
    print s.validate()

    for s in ['~:~/Desktop','xyz']:
    s = NSString.stringWithString_(s)
    f = NSExpression.expressionForConstantValue_(s)
    e = NSExpression.expressionForFunction_selectorName_arguments_(
    f,'validate',None)
    results = e.expressionValueWithObject_context_(None,None)
    print results
    print

    p = NSPredicate.predicateWithFormat_(
    "FUNCTION(SELF, 'validate') isEqual:YES")
    A = ['~/Desktop','xyz']
    A = [NSString.stringWithString_(s) for s in A]
    A = NSArray.arrayWithArray_(A)
    for item in A.filteredArrayUsingPredicate_(p):
    print item

    print p.evaluateWithObject_(NSString.stringWithString_('~/Desktop'))


    output:


    test ~/Desktop
    True
    test ~
    test ~/Desktop
    True
    test xyz
    False

    test ~/Desktop
    test xyz
    ~/Desktop
    test ~/Desktop
    True

    NSPredicate: problem solved (almost)

    It only took me about four hours, but with big help from this post, I solved the problem of filtering a string containing an array of (possibly invalid) filepaths. And, in the meantime, Dave DeLong himself had responded to my question on StackOverflow from yesterday, which if I'd just waited a bit more, would have saved me some time. Thanks, Dave.

    As I mentioned (here), we're just wrapping a filtering routine up in a category on NSString. Here is the first part of the code file including the category:


    #import <Foundation/Foundation.h>

    @interface NSString (ValidatingFilepathArray)
    - (NSNumber *) validate;
    @end

    @implementation NSString (ValidatingFilepathArray)

    - (NSNumber *) validate {
    NSArray *A = [self componentsSeparatedByString:@":"];
    NSString *s, *p;
    NSFileManager *fm = [NSFileManager defaultManager];
    for (p in A) {
    s = [p stringByExpandingTildeInPath];
    if ([fm fileExistsAtPath:s]) {
    NSLog(@"passed: %@", p);
    }
    else {
    NSLog(@"failed: %@", p);
    return [NSNumber numberWithBool:NO];
    }
    }
    return [NSNumber numberWithBool:YES];
    }
    @end


    In the code at the bottom of the post, we construct a "function expression" from the new NSString method. Two things really confused me. First, in the method


    expressionForFunction:selectorName:arguments:


    the "function" is the object itself. And the second thing was figuring out the format string for the predicate that wraps up our expression:


    @"FUNCTION(SELF, 'validate') isEqual:YES"


    This is the output:


    > gcc -o test test.m -framework Foundation
    > ./test
    2011-03-13 16:53:42.974 test[3494:903] passed: ~
    2011-03-13 16:53:42.976 test[3494:903] passed: ~/Desktop
    2011-03-13 16:53:42.977 test[3494:903] expression for ~:~/Desktop: YES
    2011-03-13 16:53:42.978 test[3494:903] failed: xyz
    2011-03-13 16:53:42.978 test[3494:903] expression for xyz: NO
    -------------------------------------------------
    2011-03-13 16:53:42.979 test[3494:903] passed: ~
    2011-03-13 16:53:42.980 test[3494:903] passed: ~/Desktop
    2011-03-13 16:53:42.980 test[3494:903] failed: xyz
    2011-03-13 16:53:42.981 test[3494:903] filtered: ~:~/Desktop
    2011-03-13 16:53:42.981 test[3494:903] passed: ~/Desktop
    2011-03-13 16:53:42.982 test[3494:903] evaluate: YES


    First we construct the expression and evaluate it. In the second part, we construct the predicate and use it to filter an array or just "evaluate" a string. Still to do: test it in an app with bindings and an NSTextField or NSTableView.

    Code:


    int main (int argc, const char * argv[]) {
    NSAutoreleasePool * pool = [[NSAutoreleasePool alloc] init];
    NSString *validPathArray = @"~:~/Desktop";
    NSExpression *f = [NSExpression expressionForConstantValue:validPathArray];
    NSExpression *e = [NSExpression expressionForFunction:f
    selectorName:@"validate" arguments:nil];
    NSNumber *result = [e expressionValueWithObject:nil context:nil];
    NSArray *responses = [NSArray arrayWithObjects:@"NO",@"YES",nil];
    NSLog(@"expression for %@: %@", validPathArray,
    [responses objectAtIndex:[result intValue]]);

    NSString *invalidPathArray = @"xyz";
    f = [NSExpression expressionForConstantValue:invalidPathArray];
    e = [NSExpression expressionForFunction:f
    selectorName:@"validate" arguments:nil];
    result = [e expressionValueWithObject:nil context:nil];
    NSLog(@"expression for %@: %@", invalidPathArray,
    [responses objectAtIndex:[result intValue]]);

    printf("-------------------------------------------------\n");

    NSPredicate *p;
    p = [NSPredicate predicateWithFormat:@"FUNCTION(SELF, 'validate') isEqual:YES"];
    NSArray *A = [NSArray arrayWithObjects: validPathArray, invalidPathArray, nil];
    NSArray *fA = [A filteredArrayUsingPredicate:p];
    for (id obj in fA) {
    NSLog(@"filtered: %@", obj);
    }
    BOOL yesorno = [p evaluateWithObject:@"~/Desktop"];
    NSLog(@"evaluate: %@", [responses objectAtIndex:(int) yesorno]);
    [pool drain];
    return 0;
    }

    NSExpression: simple examples

    The example from last time (here) introduced the NSExpression class. That one was rather complex in code, though what it does is just a simple filtering of values.

    Here is another example which looks less forbidding: it combines two expressions in a single predicate. These can then be used to construct a comparison predicate that grabs the object for key='value' from an array, and checks it against the number 10, like this:

    As before the code block is inside a standard main:


    #import <Foundation/Foundation.h>
    
    int main (int argc, const char * argv[]) {
    NSAutoreleasePool * pool = [[NSAutoreleasePool alloc] init];
    code here..    
    [pool drain];
    return 0;
    }


    and compiled like this:


    > gcc -o test test.m -framework Foundation


    code:


    NSExpression *lhs = [NSExpression expressionForKeyPath:@"value"];
    NSNumber *ten = [NSNumber numberWithInt:10];
    NSExpression *rhs = [NSExpression expressionForConstantValue:ten];
    
    NSPredicate *p = [NSComparisonPredicate 
    predicateWithLeftExpression:lhs
    rightExpression:rhs
    modifier:NSDirectPredicateModifier 
    type:NSGreaterThanOrEqualToPredicateOperatorType 
    options:0];
    
    NSArray *A = [NSArray arrayWithObjects:
    [NSMutableDictionary dictionaryWithObject:
    [NSNumber numberWithInt:3] forKey:@"value"],
    [NSMutableDictionary dictionaryWithObject:
    [NSNumber numberWithInt:15] forKey:@"value"], nil];
    NSArray *fA = [A filteredArrayUsingPredicate:p];
    for (id obj in fA) {
    NSLog(@"%@", [obj description]);
    }



    > ./test
    2011-03-13 10:58:53.885 test[218:903] {
    value = 15;
    }


    Here is a second example, taken from the docs, of how to construct an expression that uses a built-in function (they call this a function expression):


    NSArray *A = [NSArray arrayWithObjects:
    [NSNumber numberWithInt:3],
    [NSNumber numberWithInt:6],
    nil];
    NSExpression *eA = [NSExpression 
    expressionForConstantValue:A];
    NSArray *args = [NSArray arrayWithObject:eA];
    
    NSExpression *e = [NSExpression 
    expressionForFunction:@"average:" arguments:args];
    id result = [e expressionValueWithObject:nil context:nil];
    NSLog(@"%@ %@", [result description], [result class]);
    float f = [result floatValue];
    printf("result = %3.2f\n", f);



    2011-03-13 11:01:00.337 test[236:903] 4.5 NSCFNumber
    result = 4.50


    There are lots of built-in functions available (here).

    The third example is based on the first part (the simple part) of this post from Dave DeLong. It defines a category on NSNumber


    @interface NSNumber (FactorialExpression)
    - (NSNumber *) factorial;
    @end
    
    @implementation NSNumber (FactorialExpression)
    - (NSNumber *) factorial {
    double baseValue = [self doubleValue];
    double result = tgamma(baseValue+1);
    return [NSNumber numberWithDouble:result];
    }
    @end


    and the code block is:


    NSNumber *n = [NSNumber numberWithDouble:4.2];
    NSLog(@"%@ %@", n, [n factorial]);
    NSLog(@"%p %d", n, [n respondsToSelector:@selector(factorial)]);
    
    NSExpression *f = [NSExpression expressionForConstantValue:n];
    NSExpression *e = [NSExpression expressionForFunction:f
    selectorName:@"factorial" 
    arguments:nil];
    NSLog(@"operand %@ %@", [e operand], [[e operand] class]);
    NSLog(@"operand %@", [e function]);
    
    id result = [e expressionValueWithObject:nil context:nil];
    NSLog(@"%@ %@", [result description], [result class]);



    > ./test
    2011-03-13 11:02:46.798 test[251:903] 4.2 32.57809605033135
    2011-03-13 11:02:46.800 test[251:903] 0x100108d20 1
    2011-03-13 11:02:46.801 test[251:903] operand 4.2 NSConstantValueExpression
    2011-03-13 11:02:46.801 test[251:903] operand factorial
    2011-03-13 11:02:46.802 test[251:903] 32.57809605033135 NSCFNumber


    And now, the way to solve my initial question seems clear:

    Define a category on NSString that does what I want.
    Wrap the call up in an NSExpression and then an NSPredicate. Next time, if I succeed.

    Saturday, March 12, 2011

    NSPredicate: compound Cocoa example in code

    I'm resisting the temptation to call this simple---it's not! All we're doing is filtering an array of dicts for those whose 'value' is neither too large nor too small. I have no idea why it has to be this complicated. The example is from the NSPredicate docs (as reformatted and with renamed vars by me):


    #import <Foundation/Foundation.h>

    int main (int argc, const char * argv[]) {
    NSAutoreleasePool * pool = [[NSAutoreleasePool alloc] init];

    NSArray *A = [NSArray arrayWithObjects:
    [NSDictionary dictionaryWithObject:[
    NSNumber numberWithInt:5] forKey:@"value"],
    [NSDictionary dictionaryWithObject:[
    NSNumber numberWithInt:50] forKey:@"value"],
    [NSDictionary dictionaryWithObject:[
    NSNumber numberWithInt:500] forKey:@"value"],
    nil];

    NSExpression *lhs = [NSExpression
    expressionForKeyPath:@"value"];
    NSExpression *gtrhs = [NSExpression
    expressionForConstantValue:[NSNumber numberWithInt:10]];
    NSExpression *ltrhs = [NSExpression
    expressionForConstantValue:[NSNumber numberWithInt:100]];

    NSPredicate *gtpred;
    gtpred = [NSComparisonPredicate
    predicateWithLeftExpression:lhs
    rightExpression:gtrhs
    modifier:NSDirectPredicateModifier
    type:NSGreaterThanOrEqualToPredicateOperatorType
    options:0];

    NSPredicate *ltpred;
    ltpred = [NSComparisonPredicate
    predicateWithLeftExpression:lhs
    rightExpression:ltrhs
    modifier:NSDirectPredicateModifier
    type:NSLessThanOrEqualToPredicateOperatorType
    options:0];

    NSPredicate *pred;
    pred = [NSCompoundPredicate andPredicateWithSubpredicates:
    [NSArray arrayWithObjects:gtpred, ltpred, nil] ];

    NSArray *fA = [A filteredArrayUsingPredicate:pred];
    for (id obj in fA) {
    NSLog(@"%@", [obj description]);
    }
    return 0;
    }

    > gcc -o test pred.m -framework Foundation
    > ./test
    2011-03-12 11:48:31.640 test[70403:903] {
    value = 50;
    }

    NSPredicate: simple Cocoa examples

    I'm exploring predicates and expressions in Cocoa using Objective-C. Actually, I'd like to use PyObjC, but this is tricky enough that I think it's better to start with Objective-C.

    Ultimately I'd like to write complex predicates, for example, to filter using either a custom function or a block. The reason I need this is to validate edits to an NSTableView using bindings. According to the docs:


    A predicate is a logical operator that returns a Boolean value (true or false). There are two types of predicate; comparison predicates, and compound predicates:

    ● A comparison predicate compares two expressions using an operator. The expressions are referred to as the left hand side and the right hand side of the predicate (with the operator in the middle). A comparison predicate returns the result of invoking the operator with the results of evaluating the expressions.

    ● A compound predicate compares the results of evaluating two or more other other predicates, or negates another predicate.


    All the code is like this:


    #import <Foundation/Foundation.h>
    
    int main (int argc, const char * argv[]) {
    NSAutoreleasePool * pool = [[NSAutoreleasePool alloc] init];
    .. specific code here ..
    [pool drain];
    return 0;
    }


    compiled and run like this:


    > gcc -o test pred.m -framework Foundation
    > ./test

    Example 1. Predicate with a format string using "like"


    (the c is a case-insensitive compare, the * is a wild card):

    NSArray *A = [NSArray arrayWithObjects:
    @"John",@"Paul",@"George",@"Ringo",nil];
    NSPredicate *p = [NSPredicate predicateWithFormat:
    @"SELF like [c] 'g*' "];    
    NSLog(@"%@", [[A filteredArrayUsingPredicate:p] objectAtIndex:0]);
    
    > ./test
    2011-03-12 09:38:55.379 test[67230:903] George

    Example 2. Format string with a mathematical operator



    NSArray *A = [NSArray arrayWithObjects:
    [NSNumber numberWithInt:3],
    [NSNumber numberWithInt:5],nil];
    NSPredicate *p = [NSPredicate predicateWithFormat:
    @"SELF > 4 "];    
    NSLog(@"%@", [[A filteredArrayUsingPredicate:p] objectAtIndex:0]);
    
    > ./test
    2011-03-12 09:38:22.313 test[67217:903] 5


    Note: also works with the usual kind of string formatting :@"SELF > %d ", 4

    Example 3: Format string using a key path



    NSArray *A = [NSArray arrayWithObjects:
    [NSMutableDictionary dictionaryWithObject:@"x" forKey:@"name"],
    [NSMutableDictionary dictionaryWithObject:@"y" forKey:@"name"], nil];
    
    NSPredicate *p = [NSPredicate predicateWithFormat:@"name like %@", @"x"];    
    NSLog(@"%@", [[A filteredArrayUsingPredicate:p] objectAtIndex:0]);
    
    > ./test
    2011-03-12 10:01:43.963 test[67885:903] {
    name = x;
    }

    Example 4: Format string substituting a dynamic key path



    NSArray *A = [NSArray arrayWithObjects:
    [NSMutableDictionary dictionaryWithObject:@"x" forKey:@"name"],
    [NSMutableDictionary dictionaryWithObject:@"y" forKey:@"name"], nil];
    //NSLog(@"%@", [[A objectAtIndex:0] description]);
    
    NSString *name = @"name";
    NSString *value = @"x";
    NSPredicate *p = [NSPredicate predicateWithFormat:@"%K like %@", name, value];    
    NSLog(@"%@", [[A filteredArrayUsingPredicate:p] objectAtIndex:0]);
    
    > ./test
    2011-03-12 09:29:59.220 test[67013:903] {
    name = x;
    }

    Example 5. Compound predicate



    NSArray *theKeys = [NSArray arrayWithObjects:@"name",@"value",nil];
    NSArray *o1 = [NSArray arrayWithObjects:@"x",[NSNumber numberWithInt:5],nil];
    NSArray *o2 = [NSArray arrayWithObjects:@"y",[NSNumber numberWithInt:7],nil];
    NSArray *o3 = [NSArray arrayWithObjects:@"z",[NSNumber numberWithInt:9],nil];
    
    NSArray *A = [NSArray arrayWithObjects:
    [NSMutableDictionary dictionaryWithObjects:o1 forKeys:theKeys],
    [NSMutableDictionary dictionaryWithObjects:o2 forKeys:theKeys], nil];
    NSLog(@"%d", [A count]);
    
    NSPredicate *p = [NSPredicate predicateWithFormat:
    @"(value < %@) OR (name == %@)", [NSNumber numberWithInt:6], @"y"]; 
    NSArray *fA = [A filteredArrayUsingPredicate:p];
    NSLog(@"%d", [fA count]);
    for (id obj in fA) {
    NSLog(@"%@", [obj description]);
    }
    
    > ./test
    2011-03-12 10:34:11.921 test[68778:903] 2
    2011-03-12 10:34:11.924 test[68778:903] 2
    2011-03-12 10:34:11.925 test[68778:903] {
    name = x;
    value = 5;
    }
    2011-03-12 10:34:11.925 test[68778:903] {
    name = y;
    value = 7;
    }

    Dental project (6)

    I want to show some more results from this project, namely, the UniFrac analysis. What I did for the paper was to cluster very closely related sequences (alignment > 450 and 0 or 1 mismatches), then upload them to RDP, which aligns the sequences as they are uploaded. The phylogenetic tree needs to be rooted, and I decided to use Thermotoga SL7 for this (Genbank AJ401017).

    Rather than deal with the clustered OTUs for this post, I just uploaded all 1120 sequences, and carried out the analysis. The first time through (today) I forgot to include the outgroup! So that gives us a chance to see how much difference it makes.

  • Working in directory: temp
  • Check that seqs.fna from dental project dir has 1120 seqs
  • Rename to dental_1120.fna
  • Upload to RDP
  • Download as dental_1120_rdp.fna

  • Use R/ape to make a tree


    setwd('Desktop/temp/rdp')
    library(ape)
    dna = read.dna('dental_1120_rdp.fna',format='fasta')
    tr = nj(dist.dna(dna))
    plot(tr)
    write.tree(tr,'tree.txt')


  • Write a simple script to make the environment file


    python write_env.py > environ.txt


    It looks like this:


    DAA_44  D_DAA
    DAA_43 D_DAA
    DQ_209 D_DQ
    ..


  • Upload tree and environment file to UniFrac
  • PCA (unweighted)
  • View as data table



  • Download data to pca_web.txt
  • Run the script below to plot the data using matplotlib.

    Here it is:




    From the UniFrac FAQ:


    My tree was not rooted, but I was able to upload my file and perform an analysis. Are the results valid?

    There is no way to tell based on a Newick string alone whether a tree is rooted or not. If an unrooted tree is input, UniFrac will usually assign an arbitrary root and allow you to perform the analysis on that tree. How the tree is rooted can affect the results of both UniFrac tests and the P test. You should redo the analysis with a tree that is rooted with an appropriate outgroup.


    It turns out to be easy enough..go back to RDP and browse to find SL7 and then add it to the sequence cart. Repeat the download to dental_1120+_rdp.fna. Load the last 5 sequences into clustalx.app and look to make sure that SL7 is really properly aligned.




    setwd('Desktop/temp/rdp')
    library(ape)
    dna = read.dna('dental_1120+_rdp.fna',format='fasta')
    tr = nj(dist.dna(dna))

    > tr

    Phylogenetic tree with 1121 tips and 1119 internal nodes.

    Tip labels:
    DAA_44, DAA_43, DQ_209, DAA_45, DAA_40, DC_81, ...

    Unrooted; includes branch lengths.

    > grep('SL', tr$tip.label)
    > 1121


  • Root the tree appropriately and write it to disk


    tr2=root(tr,1121)
    plot(tr2)
    write.tree(tr2,'rooted_tree.txt')


  • Go back to UniFrac



    Repeat the PCA. You can look at the data in a spreadsheet app:



    Now I plot it in matplotlib. The first image is what I plotted today for the rooted tree. The second is from the paper. Looks pretty good to me. Also, note some minor differences from the previous graphic where the tree we used was unrooted (and UniFrac rooted it for us however it does when it's not properly rooted).





    plot_web.py


    import sys
    import matplotlib.pyplot as plt
    from fileUtilities import load_data

    d = 0.5

    fn = 'pca_web.txt'
    data = load_data(fn)
    data = data.strip().split('\n\n')[0]
    data = data.strip().split('\n')[1:]
    L = list()

    for e in data:
    name, x, y = e.split()[:3]
    x,y = float(x), float(y)
    x *= -1
    name = name[2:]
    if name[1] in 'BCM': c = 'blue'
    else: c = 'red'
    plt.scatter(x,y,s=100,color=c)
    if name == 'DG':
    y += 0.03
    plt.text(x+0.03,y-0.02,va='center',
    s=name[1:],color=c,fontsize=16)
    plt.plot((-d,d),(0,0),':',zorder=0)
    plt.plot((0,0),(-d,d),':',zorder=0)

    ax = plt.axes()
    ax.set_xlim(-d,d)
    ax.set_ylim(-d,d)
    plt.savefig('pca_web.png')
  • Dental project (5)



    This post is one of a series (see dental project here or in the sidebar).

    Last time I said I would show you how I make heatmaps these days. I've approached it several different ways over the past few years (R, Cocoa, matplotlib), but I think now that matplotlib is best, at least for me. Ultimately what I want is flexibility, and if you're a Python coder and you have matplotlib installed (as we've also discussed many times), then you'll have that. But I don't want to get into the technical details---and actually the script is a bit long, so I just put it (Heatmapper.py and its helper Preprocessor.py) into the zipped project files on Dropbox (here). The output from two different modes is at the bottom of the post. You just need a file data.csv in the same directory. It looks a little fuzzy and not as clean as I would like, but that's because there are so many samples, and partly because of the italic font. If you do savefig to a pdf file, and then blow it up, it looks great.

    In this post I want to talk in a general way about the project and what I think it means. It began about four years ago, when we became aware that some folks in Dentistry at our school (WVU) were involved in a huge study of people from Appalachia (it's called COHRA). Poor oral health is a particular problem in West Virginia, and this study had collected thousands of samples along with patient histories and lots of clinical data. My belief is that the important thing about these samples is that the patients are young yet have serious periodontal issues. In any event, we convinced the people who actually run the project (based elsewhere) to let us have (a small part of) 8 samples out of all their thousands sitting in the freezers down the hall.

    We did PCR with "universal" primers for the bacterial 16S rRNA gene, and cloned and sequenced the numbers you see in the table. It's not a big study (we don't have much money anyway), but we saw something which I think is truly significant. In high disease individuals, a broad group of microbes from the Clostridiales including an unusual clade called the Veillonellaceae are increased in abundance, whereas the sequences from control individuals in this clade were all very closely related to Veillonella parvula.

    One reason this observation may be important is that the so-called "red complex", which is thought to be associated with serious periodontal disease, can only be recovered in about half the individuals with this diagnosis (not even considering abundance).

    That story is in the modified version of the map above, where I drew a red box around the region of interest for the three controls, or "low disease" samples. Time went on, and another set of samples was added to the study from a different group, and we were able to get the work published. So that's why the study looks so old-fashioned, in an era of millions of reads, we've got about a thousand.

    My role in all this was to actually do the analysis. I remember "we" wrote a grant (actually, someone else did!) and listed me as a technical expert in bioinformatics. Of course, the reviews were scathing. Dr. E doesn't have a degree in bioinfomatics. How could he know anything?

    Well, I've learned a few things over the years. Rule one is, never make your own database: let someone else do it. That's why HOMD (and Greengenes and RDP) are so great. I particularly like the tools at the RDP site. It is very nice software.

    And rule two is, if you live long enough, you will see work that took you months or years to accomplish be achieved using new tools in mere seconds or hours. Sequencing is a great example of this. When I was young I spent most of three months getting 500 bp; when I was a bit older I invested six months for 3.5 kb; still later it was a year for 20 kb.

    This project is another example. I spent a year and more writing some 50 or so Python scripts (and rewriting them), and now QIIME does the whole thing in mere seconds.

    Well, not quite the whole thing. I have a bit more to do with this project. I want to show you the UniFrac analysis of beta diversity, and show how to make what I think is a nicer plot of the PCoA results. Also, I want to show some phylogenetic trees detailing the increased diversity (species richness, really) in the Veillonellaceae that I mentioned.

    And I should say: it's been fun. Even if I don't have that degree, or any papers with Rob Knight, I think I've learned something about Bioinformatics in the last 5 years.




    Thursday, March 10, 2011

    Flag update

    Dental project (4)



    This post is one of a series (see dental project here or in the sidebar).

    After getting a set of sequences and removing chimeras, the next step is almost anticlimactic. We just copy a modified version of the shell script (from here, without the cd calls) or paste in the commands working from the dental directory (either individually, or all at once):


    #!/bin/bash

    pick_otus.py -i seqs.fna -m uclust -s 0.97 -o otus
    pick_rep_set.py -f seqs.fna -i otus/seqs_otus.txt -m most_abundant -o otus/reps.txt
    align_seqs.py -i otus/reps.txt -m pynast -t ~/data/core.txt -o aln
    assign_taxonomy.py -i otus/reps.txt -m rdp -o tax
    filter_alignment.py -i aln/reps_aligned.txt -m ~/data/mask.txt -o aln2
    make_phylogeny.py -i aln2/reps_aligned_pfiltered.fasta -o figs/tree.tre
    make_otu_table.py -i otus/seqs_otus.txt -t tax/reps_tax_assignments.txt -o figs/otu_table.txt
    summarize_taxa.py -i figs/otu_table.txt -o figs/otu_table_Level3.txt -L 3
    plot_taxa_summary.py -i figs/otu_table_Level3.txt -l Phylum -o figs -k white
    make_otu_heatmap_html.py -i figs/otu_table.txt -o figs


    It's all over in a few seconds.

    The heatmap QIIME produced is at the top of the post. It is truly a remarkable html page, with a graphic where you can reorder the columns or rows by drag and drop, and redo the map at different threshholds for the OTUs, etc. I've never seen anything quite like it. But (and this is just me), it's not pretty enough.

    So what I'd like to do from here is to show you how I currently make heatmaps with matplotlib, and we'll get into that next time.

    First, I have to extract the data from QIIME. The script is complicated a bit by an additional job: I'm going to organize the rows and columns. (QIIME can do this too---see the tutorial).

    The columns will be in the order they appear in sample_names.txt and the rows as they appear in genera_and_colors.txt. These files are in the same directory. The second one starts like this:


    # Bacteria black
    Bacteria
    # Bacteroidetes green
    Bacteroidetes
    Bacteroidales
    Rikenella
    Prevotella
    Porphyromonas
    Saprospiraceae
    Sphingobacteriales


    I just do this:


    > python grab_qiime_data.py > data.csv


    Here's the first part of data.csv:


    ,DB,DC,DM,DF,DL,DG,DI,DA,DT,DQ,DV,DAA,DZ
    Bacteria,,,,2,,,,,,,,,12
    Bacteroidetes,,,,,,2,,,,,,,
    Bacteroidales,,,,,,1,,,,,,,
    Prevotella,,,,,3,2,,1,,,,,
    Porphyromonas,,,,,,1,,1,,,,,
    Sphingobacteriales,,,,,,,,,,1,,,
    Capnocytophaga,9,,6,2,3,6,1,19,31,35,2,8,7
    Fusobacterium,1,,,4,,2,1,2,,2,1,3,1


    The leading comma on line 1 is so the column headers line up properly in a spreadsheet. Speaking of spreadsheets, here is a screenshot after dropping the data.csv file onto Numbers (you could use Excel, of course):



    That was painless! Zipped project files in Dropbox (here).

    Wednesday, March 9, 2011

    Dental project (3)



    wikimedia

    Before starting on analysis of the 1124 sequences from last time (here), we need to check for chimeras.

    And at this point, I have a confession to make. It turns out there are 3 and perhaps 4 chimeras in the set of sequences from Genbank. I discovered this unwelcome fact a few weeks ago when playing with the QIIME toolkit. Since one of the pieces of software they recommend is ChimeraSlayer, I tried it out on these sequences.

    Make a directory temp with a copy of seqs.fna. The sequences first need to be converted to NAST format, then we can run ChimeraSlayer.pl.


    prog1=~/Software/microbiomeutil_2010-11-02/NAST-iEr/run_NAST-iEr.pl
    $prog1 --query_FASTA seqs.fna > seqs.nast

    prog2=~/Software/microbiomeutil_2010-11-02/ChimeraSlayer/ChimeraSlayer.pl
    $prog2 --query_NAST seqs.nast


    It takes the better part of an hour on my slowest machine (a 5 year old iMac).

    seqs.nast.CPS.CPC.wTaxons has flagged four sequences:


    DA228 INTRA-GENUS
    DQ822 INTRA-PHYLUM
    DV55 INTRA-FAMILY
    DAA89 INTRA-FAMILY


    I grab those four by hand into a new file suspects.fna (there is probably a better way) and do:


    $prog1 --query_FASTA suspects.fna > suspects.nast
    $prog2 --query_NAST suspects.nast --printCSalignments option


    The output shows there is definitely a problem. In suspects.nast.CPS.CPC.wTaxons we have:


    ChimeraSlayer DQ_822 S000427388 S000260335 1.0566 98.74 100 0.7978 74.56 0 YES NAST:1861-1863 ECO:324-325 Streptococcus Streptococcus cristatus (T); NCTC12479; AB008313 Streptococcus cristatus Lachnospiraceae Incertae Sedis Clostridium aerotolerans (T); DSM 5434; X76163 Clostridium aerotolerans INTRA-PHYLUM
    Per_id parents: 73.80

    Per_id(Q,A): 93.45
    --------------------------------------------------- A: S000427388
    99.64 78.63
    ~~~~~~~~~~~~~~~~~~~~~~~~\ /~~~~~~~~~~~~~~~~~~~~~~~~ Q: DQ_822
    DivR: 1.057 BS: 100.00 |
    Per_id(QLA,QRB): 98.74 |
    |
    (L-AB: 72.50) | (R-AB: 76.92)
    WinL:0-279 | WinR:280-396
    |
    Per_id(QLB,QRA): 74.56 |
    DivR: 0.798 BS: 0.00 |
    ~~~~~~~~~~~~~~~~~~~~~~~~/ \~~~~~~~~~~~~~~~~~~~~~~~~~ Q: DQ_822
    72.86 96.58
    ---------------------------------------------------- B: S000260335
    Per_id(Q,B): 79.85

    DeltaL: 26.79 DeltaR: -17.95

    !
    CTAACGAGGAGGCGCTTGGTTAAGGGCTAGCTAAATTGCATGATGGTCAATGGGAAATCC A: S000427388
    CTAACGAGGAGGCGCTTGGTTAAGGGCTAGCTAAATTGCATGATAGTCAATGGGAAATCC Q: DQ_822
    TCCGACTAAGATTCGGAATCGGGCACAAGATCTCGACAGGCAGCACAGTGGAACTCCGGT B: S000260335
    !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! !!!!!!!!!!!!!!!


    ACCCTTGTCGTGACATC A: S000427388
    ACCCTTGTCGTGACATC Q: DQ_822
    GGTGCCTGGACCCAGCT B: S000260335
    !!!!!!!!!!!!!!!!!

    ** Breakpoint **

    !!!!!!!!!!!!! !!!!! !!!!!!!
    GACCGACGGTCGAGTGTATCGGGGTAAA A: S000427388
    AGAGAGGACCTCGGTTATATGACAGCGG Q: DQ_822
    AGAGAGGACCTCGAGTATATCACAGCAG B: S000260335
    !! ! !


    The first match is great for a while, then terrible, and the second is the converse.

    I need to look into whether I should update the Genbank records, but I guess probably the answer is yes.

    Anyway, I should have discovered this easily. I wrote a Python tool that looks for chimeras by BLAST of the front and back "halves" of each sequence against our local "boutique" database. It prints the top five hits for each. Here is the output for three of the suspects:


    >DQ822
    BLAST front(len = 207):
    320 207/207 100.00 Streptococcus_clone_BP2-57_AB121930.1
    319 205/207 99.03 Streptococcus_clone_502H08_AM420202.1
    323 203/207 98.07 Streptococcus_cristatus_AB008313.1
    321 200/207 96.62 Streptococcus_clone_BW009_AY005042.1
    334 199/209 95.22 Streptococcus_sanguinis_SK36_SK36
    BLAST back(len = 187):
    349 179/187 95.72 Uncultured_clone_4.59_DQ346409.1
    79 179/187 95.72 Clostridiales_clone_301C11_AM420062.1
    121 178/187 95.19 Eubacterium_clone_DO008_AF385508.1
    117 178/187 95.19 Eubacterium_clone_BP2-88_AB121960.1
    115 178/187 95.19 Eubacterium_clone_BL026B96_AY806377.1

    >DV55
    BLAST front(len = 210):
    353 207/210 98.57 Uncultured_clone_E105_DQ326659.1
    100 206/210 98.10 Dialister_sp._E2_20_AF481209.1
    94 205/210 97.62 Dialister_invisus_AY162469.1
    31 190/207 91.79 Allisonella_clone_BL34_DQ130020.1
    93 192/210 91.43 Dialister_clone_MCE7_134_AF481210.1
    BLAST back(len = 190):
    373 190/190 100.00 Veillonella_parvula_X84005.1
    372 190/190 100.00 Veillonella_clone_X042_AF287781.1
    370 190/190 100.00 Veillonella_clone_BU083_AF366266.1
    369 190/190 100.00 Veillonella_clone_AA050_AF287782.1
    371 182/183 99.45 Veillonella_clone_R1_DQ123569.1

    >DAA89
    BLAST front(len = 210):
    283 209/210 99.52 Selenomonas_clone_CI002_AF287798.1
    286 202/210 96.19 Selenomonas_clone_EQ054_AF385495.1
    288 201/210 95.71 Selenomonas_clone_FT050_AY349403.1
    298 189/195 96.92 Selenomonas_noxia_AF287799.1
    297 200/210 95.24 Selenomonas_infelix_AF287802.1
    BLAST back(len = 190):
    373 184/187 98.40 Veillonella_parvula_X84005.1
    370 184/187 98.40 Veillonella_clone_BU083_AF366266.1
    372 181/184 98.37 Veillonella_clone_X042_AF287781.1
    369 181/184 98.37 Veillonella_clone_AA050_AF287782.1
    371 182/187 97.33 Veillonella_clone_R1_DQ123569.1


    Note on sequence titles: I just introduced the underscore recently (as in DA_228), so this output doesn't have them.

    It's pretty obvious that these guys are problematic. What happened is that I integrated the tool into the toolchain, but I never wrote code to look through the output and flag potential problems. I always did it manually, and as additional sequence samples were added to the experiment, I forgot to carry out this step.

    Moral of the story: if you want to be sure something gets done, every time, you need to automate it completely! Otherwise you might forget.

    We'll remove these from our sequence file by hand. Now there are 1120.


    DA_228
    DQ_822
    DV_55
    DAA_89

    David Broder

    I'm a political junkie. Can't help it. I start the day with Ezra Klein (after he appears online), and perhaps, end the day with Josh Marshall's gang. So it is with sadness that I hear of David Broder's passing (no link, the Post obit runs an ad).

    There was a time in the 90s when I watched Washington Week in Review (when it was something, with Ken Bode as the moderator) and Broder was always impressive. I think he was the last of the giants---think David Brinkley, Walter Cronkite, Charles Kurault.

    Now we're left with screeching vapid idiots like---well, anyone on CNN except David Gergen. I miss the time when serious analysis could be found on TV. Sigh. Perhaps a good analogy is this: think of the difference between Hubie Brown (or Doug Collins) and anyone else who calls basketball games. I love Hubie, he's entertaining, but even more important, you can actually learn something.

    Dental project (2)

    This is a series, first post here. Before we do anything else, we need to clean up the titles on the sequences. They come from Genbank like this:


    >gi|324104022|gb|HQ894465.1| Uncultured bacterium clone DA19 16S ribosomal RNA gene, partial sequence


    We want this:


    >DA_19
    .

    We could do something like a regular expression, but that's overkill. Note that the alpha part is variable length, so we have to be a little bit smart. But these are so regular, it's easy. Also, UniFrac wants an underscore, so we add that.

    Just do this from the command line:


    > python retitle.py > seqs.txt



    If you count the sequences, you should have 1124.

    retitle.py


    import utils as ut
    digits = '0123456789'

    data = ut.load_data('results.txt')
    data = data.strip().split('\n\n')
    for item in data:
    title,seq = ut.clean_fasta(item)
    e = title.split()[4]
    d = ''.join([c for c in e if c in digits])
    s = ''.join([c for c in e if not c in digits])
    print '>' + s + '_' + d
    print seq
    print

    Dental project (1)

    I'd like to spend a few posts talking about a project we just completed on surveying the bacterial species present in patients with gingivitis compared to normal controls. The main reason is to exercise QIIME with a different data set, but I'd also like to say a bit about the project.

    I checked this morning at PubMed and found the paper has come out:

    Olson 2011 PMID 21362199

    Click on the link to download, 7.6 MB, that's just about the largest file size for a paper that I ever saw. Opening it up, I see why. It's still in manuscript form, and some of the figures are quite big. To play with this, we'll need to get the sequences. Luckily they were just posted by Genbank the other day.

    I wrote a script to grab the sequences in chunks of 40, with a timer to sleep for 10 seconds between requests. The first sign of trouble was here:


    HQ895465 HQ895504 1
    URL Error
    HQ895505 HQ895544 1
    ..


    but eventually, we did another request for this batch which looked like it worked:


    HQ895465 HQ895504 2


    but actually, the file contains this near the end:


    PubseqAccess cmd(id_gi_by_word HQ895505) failed with Cannot get server name from load balancer &apos;PUBSEQ_OS_PUBLIC&apos; : errmsg=&apos;Service not found&apos; HQ895505


    and then more of the same. Looking at the sequences, it seems they cut us off with 1000 sequences.. stopping with HQ895464.1

    I thought this should be OK. It's very early in the morning, with more than 3 seconds between requests, but apparently we ran up against some kind of limit.

    I give the server some time to calm down, (and change the name of the file we've written to), edit the list and try again:


    > python fetchSeqs.py 
    HQ895465 HQ895504 1
    HQ895505 HQ895544 1
    HQ895545 HQ895584 1
    HQ895585 HQ895588 1


    then combine by hand.. Next time we'll take a look at them.

    fetchSeqs.py

    import urllib2, sys, time
    from utils import load_data

    ncbi = 'http://eutils.ncbi.nlm.nih.gov'
    eutils = ncbi + '/entrez/eutils/'
    efetch = 'efetch.fcgi?'

    def chunks(L,SZ):
    rL = list()
    while L:
    rL.append(L[:SZ])
    L = L[SZ:]
    return rL

    def fetch(L):
    s = eutils + efetch
    s += 'id=' + L
    s += '&db=nucleotide&rettype=fasta&retmode=text'
    try:
    FH = urllib2.urlopen(s)
    data = FH.read()
    except:
    raise ValueError('URL Error')
    if 'NCBI C++' in data:
    raise ValueError('Empty')
    elif not data:
    raise ValueError('Empty')
    return data

    def run(FH):
    first = 894465
    #first = 895465
    last = 895588
    L = ['HQ' + str(n) for n in range(first,last+1)]
    rL = list()
    L = chunks(L,SZ=40)
    L = [(e,1) for e in L]
    while L:
    sL,n = L.pop(0)
    print sL[0], sL[-1], n
    if n > 3: continue
    try:
    s = fetch(','.join(sL))
    FH.write(s)
    except ValueError as e:
    print e
    L.append((sL,n+1))
    time.sleep(10)

    if __name__ == '__main__':
    FH = open('results.txt','w')
    L = run(FH)
    FH.close()

    Tuesday, March 8, 2011

    16S rRNA V regions, continued



    As promised last time (here), this is the code to plot V regions from the sequences in the Enterobacteriales, obtained from RDP. If you compare to this post from the other day, you'll see we match pretty well.

    The figure looks better with a wider window, but the limits are established more accurately with a smaller window. The threshold T needs to be adjusted based on the window size.

    One detail: I did the redirect below to save the extreme values and then parsed them with the code at the end of the post.


    python script.py > extreme_values.txt


    script.py

    import utils as ut
    from info import shannon
    import matplotlib.pyplot as plt

    data = ut.load_data('entero.txt')
    data = data.strip().split('>')[1:]
    EC = [e for e in data if 'X80725' in e][0]
    EC = ut.clean_fasta(EC)[1]
    data = [ut.clean_fasta(e)[1] for e in data]
    L = ut.make_count_list(data)

    pos = 0
    R = range(1,1451)
    iL = list()

    for i,c in enumerate(EC):
    if c == '-': continue
    pos += 1
    if not pos in R: continue
    cD = L[i]
    e = shannon(cD,'ACGT')
    iL.append(e)
    #print str(pos).ljust(3), c + ' ',
    #print ''.join([str(cD[k]).ljust(5) for k in 'ACGT']),
    #print round(e,2)

    aL = list()
    w = 20
    T = 1.8
    for i in range(len(iL)):
    j, k = i - w, i + w + 1
    if j < 0: j = 0
    if k > len(iL): k = len(iL)
    m = ut.mean(iL[j:k])
    if m < T: print i+1
    aL.append(m)

    plt.plot((1,1451),(T,T),lw=2,color='r',zorder=0)
    plt.scatter(R,aL)
    ax = plt.axes()
    ax.set_xlim(-5,1455)
    ax.set_ylim(0.8,2.05)
    plt.savefig('example.pdf')


    analyze.py


    import utils as ut
    data = ut.load_data('extreme_values.txt')
    L = [int(n) for n in data.strip().split('\n')]

    current = L.pop(0)
    print current, '-',
    while L:
    next = L.pop(0)
    if next != current + 1:
    print current
    print next, '-',
    current = next
    print next