## Sunday, March 27, 2011

### Equation of the ellipse

Here is a simple derivation of the equation for an ellipse. It is taken from Morris Kline's book: Calculus, An intuitive and Physical Approach.

We place the two foci of an ellipse (F and F') at the coordinates `(c,0)` and `(-c,0)`. Each point on the ellipse is defined by the property that the sum of the distances to F and to F' is constant, which we define as equal to `2a`. Our old friend Pythagoras helps us find the distances in terms of `x,y` and `c`.

 `PF = √[y2 + (x-c)2]PF' = √[y2 + (x+c)2]2a = PF + PF'PF = 2a - PF'√[y2 + (x-c)2] = 2a - √[y2 + (x+c)2]`

Square both sides (and expand):

 `y2 + x2 - 2xc + c2 = 4a2 - 4a √[y2 + (x+c)2] + y2 + x2 + 2xc + c2`

Cancel terms (`x2`, `y2`, and `c2`) and rearrange to isolate the remaining square root:

 `- 2xc = 4a2 - 4a √[y2 + (x+c)2] + 2xc4a √[y2 + (x+c)2] = 4a2 + 4xca √[y2 + (x+c)2] = a2 + xc`

Square both sides again and expand:

 `a2 (y2 + x2 + 2xc + c2) = a4 + 2a2xc + x2c2a2y2 + a2x2 + a2c2 = a4 + x2c2a2x2 - x2c2 + a2y2 = a4 - a2c2`

Factor out `a2 - c2`:

 `x2 (a2 - c2) + a2y2 = a2(a2 - c2)`

Define `b2 = a2 - c2`

 `b2x2 + a2y2 = a2b2`

Divide to obtain the familiar form:

 `x2/a2 + y2/b2 = 1`

Note that when

 `x = 0, y = +/- by = 0, x = +/- a`

The squaring method allows the possibility that the simplified equation has solutions that are not valid for the original version. This turns out not to be the case, and Kline deals with this issue in the book.

Also, we might note that since

`b2 = a2 - c2`

if `a` and `b` are fixed, then `c` is determined. For the figure shown here, `b = c` and hence `a = √2 b`.

UPDATE: A virtually identical proof can be found in Wikipedia (here)---should have looked first.

### Flags, detail

A quick note about the flags. There are enough now that it's hard to be sure I got them all. So.. I went to the Flag Counter site (this page, and the next). Rather than do anything fancy, I just copied the text to a file and then processed it with Python. My database of flag images is from Wikipedia, and I shortened the file names by the country code. Since I can't remember a number of them, I wrote a Python script to harvest the Flag Counter entries and match them with the country codes (from here).

I checked the directory with the flag images by eye, which is almost certainly a mistake.

Here is the list of countries from which visitors to this site have come, in alphabetical order. The script is at the end. It shows a number of typical issues you run into with this kind of processing.

 `AE United Arab EmiratesAL AlbaniaAN Netherlands AntillesAR ArgentinaAT AustriaAU AustraliaBB BarbadosBE BelgiumBG BulgariaBH BahrainBR BrazilBY BelarusCA CanadaCH SwitzerlandCL ChileCN ChinaCO ColombiaCR Costa RicaCS SerbiaCV Cape VerdeCY CyprusCZ Czech RepublicDE GermanyDK DenmarkDZ AlgeriaEC EcuadorEE EstoniaEG EgyptES SpainFI FinlandFR FranceGH GhanaGR GreeceHK Hong KongHR CroatiaHU HungaryID IndonesiaIE IrelandIL IsraelIN IndiaIS IcelandIT ItalyJM JamaicaJP JapanKR South KoreaLT LithuaniaLU LuxembourgMA MoroccoMD MoldovaMT MaltaMU MauritiusMX MexicoMY MalaysiaNL NetherlandsNO NorwayNZ New ZealandPA PanamaPE PeruPH PhilippinesPK PakistanPL PolandPR Puerto RicoPT PortugalQA QatarRO RomaniaRU RussiaSA Saudi ArabiaSE SwedenSG SingaporeSI SloveniaSK SlovakiaSV El SalvadorTH ThailandTN TunisiaTR TurkeyTT Trinidad and TobagoTW TaiwanUA UkraineUK United KingdomUS United StatesUY UruguayVE VenezuelaVN VietnamZA South Africa`

 `from utils import load_dataspecials = { 'South_Korea':'Korea_(South)', 'Russia':'Russian_Federation', 'New_Zealand':'New_Zealand_(Aotearoa)', 'Serbia':'Serbia_and_Montenegro', 'Croatia':'Croatia_(Hrvatska)', 'Vietnam':'Viet_Nam' }data = load_data('country-codes.txt')D = dict()for line in data.strip().split('\n'): L = line.strip().split() D['_'.join(L[1:])] = L[0]cL = list()data = load_data('scraped.txt')for line in data.strip().split('\n'): L = line.strip().split() i = len(L) - 4 country = '_'.join(L[1:i]) cL.append(country) if country in specials: k = specials[country] D[country] = D[k]def f(k): return D[k]for country in sorted(cL, key=f): print D[country],'\t', country.replace('_',' ')`

### Flag update

We continue to accumulate unique visitors from new countries. Here are the flags of 16 more.

## Friday, March 18, 2011

### nothing in biology makes sense except in the light of evolution

I've been reading Mike Yarus's book: Life from an RNA world (Amazon). It's a very readable account of evolution from the perspective of sequences and the RNA world. Mike's a highly intelligent guy, and that and his wit inform every page. In one chapter, he gets Rob Knight and Steve Freeland to do an evolution simulation.

we let a computer write out a random string, mutate 1 in 100 characters in each generation, and select changes only if they match Dobzhansky

In my version:
• in each generation we pick one position
• if it already matches Dobzhansky, continue to the next generation
• mutate to a random choice from the set of symbols

This isn't a very good model of evolution. It was just fun to spend a few minutes coding it.

Here is the beginning, middle and end of one run:

 `> python evolve.py 0 LVSTUknwIfGIRHFgHESZ M zthWtNQTk qgtoeMvjeJAzOidKWEZO ZwNjDyCvq 100 LVSTUknwIfGIRHFgHESZ M zthetNQTk qgtoeMvjeJAzOidKWEZO ZwNjDyCvq 200 LVSTUknwIfGIRHFgHESZ M zthetNQTk qgtoeMvjeJAzOidhWEZO ZwNjDyCvq 300 LVSTiknwIfGIRHFgHESZ M zthetNQ k qgtoeMvjeJAzOidhWEZO ZwNjuyivq 6100 notTing if biHlogy makes sense except iv the Oight of Zvolution 6200 notTing if biHlogy makes sense except iv the Oight of Zvolution 6300 nothing if biHlogy makes sense except iv the Oight of Zvolution 6400 nothing if biHlogy makes sense except iv the Oight of Zvolution13800 nothing in biHlogy makes sense except in the light of evolution13900 nothing in biHlogy makes sense except in the light of evolution14000 nothing in biHlogy makes sense except in the light of evolution14100 nothing in biHlogy makes sense except in the light of evolution14167 nothing in biology makes sense except in the light of evolution14167 nothing in biology makes sense except in the light of evolution`

 `import randoms = 'nothing in biology makes sense's += ' except in the light of evolution'N = len(s)symbols = 'abcdefghijklmnopqrstuvwxyz 'symbols += symbols.upper()L = [random.choice(symbols) for i in range(N)]print ' 0 ' + ''.join(L)i = 0while L != list(s): i += 1 v = i and not i % 100 if v: print str(i).rjust(5), j = random.choice(range(N)) if L[j] == s[j]: if v: print ''.join(L) continue c = random.choice(symbols) if c == s[j]: L[j] = c if v: print ''.join(L) printprint str(i).rjust(5), sprint str(i).rjust(5), ''.join(L)`
• ### Mutual Information (3)

We're working on a paper from Michael Laub's lab at MIT (Skerker et al 2009 PMID 18555780). Previous posts here and here.

Now it's time to analyze the data. The self comparisons (a single position in the alignment, analyzed for mutual information against itself) yield info values ranging from 4.1 to 0.01.

 ` 90 4.102260 4.025122 3.993209 3.976234 3.944101 3.939235 3.938152 3.935 63 3.91222 3.887159 0.191250 0.138132 0.133291 0.086199 0.067242 0.052 16 0.012106 0.012156 0.012158 0.012`

As you can see in the screenshot, the residue in column 16 (1-based indexing), which has a very low score, is the conserved (catalytic) histidine.

We'll filter out the self comparisons for the plots. Here is the histogram of information values that I got. It's a bit different from the paper, but not much.

In the next part, we'll try to match up the residues which look like they might interact (that have high mutual information) and see if that makes sense in terms of the protein structures.

I picked a familiar HK/RR pair to do this: EnvZ and OmpR. This screenshot of part of Fig 2 shows a section of each protein.

Searching in the alignment file (the alignments don't have titles that I recognize), I recovered a sequence that I think is probably the right one. It matches the Figure as far as I checked:

 `>26250004-26250005/1-457KQLADDRTLLMAGVSHDLRTPLTRIRLATEMMSAESINKDIEECNAIIEQFIDYLRTGMADLNAVLGEVIA--AESGYEREIETAL-YVKMHPLSIKRAVANMVVNAARY-GNGWIKVSSGTEAWFQVEDDGPGIAPEQRKHLFQPFVRGDISGTGLGLAIVQRIVDNHNGMLELGTSERGGLSIRAWLPNYKILVVDDDMRLRALLERYLTEQGFQVRSVANAEQMDRLLTRESFHLMVLDLMLPGEDGLSICRRLRSQSPMPIIMVTAKGEEVDRIVGLEIGADDYIPKPFNPRELLARIRAVLRR`

I put the newlines in to help me count. The alignment is 308 residues total. From Fig 2,

EnvZ (the HK) sequence starts with:

 `AGVKQLADDRTLLMAGVSHDL KQLADDRTLLMAGVSHDL`

The second line above is from the alignment. The sequence doesn't start at residue 1, however. By my calculation, residue 0 in the alignment (the K) is residue 15 in the protein (1-based index). So we'll adjust the indexes we obtain by adding 15 to compare with the actual protein sequence.

OmpR (the RR) sequence starts with

 `MQENYKILVVDDDMRLRALLER NYKILVVDDDMRLRALLER`

The second line above is from the alignment, where there's an N at position 0 in this fragment. By my calculation, that N is residue 3 of OmpR in Fig 2 (1-based index), and residue 190 of the alignment, so we subtract 187 for values >= 190.

Going back to the plotting script, we filter the data for pairs in which one comes from the HK and one from the RR, and print the top 20. We do the math as outlined above. As a check, we grab the putative EnvZ/OmpR sequence from the alignment file and print the sequence starting at the position we've identified. Here are the results for the top 20 (in each pair the RR is printed first and the HK second).

[Note: to keep the code simple, I ignored the situation where a column contains '-' a gap in the alignment for some of the sequences. That's what's giving us the two strange results below at position 16 and 20.]

 `> python plot.py 14 RLRAL 42 ATEMM 0.822 18 LLERY 42 ATEMM 0.816 22 YLTEQ 42 ATEMM 0.769 15 LRALL 37 TRIRL 0.730 15 LRALL 42 ATEMM 0.694 56 MLPGE 54 DIEEC 0.688 14 RLRAL 54 DIEEC 0.678 21 RYLTE 42 ATEMM 0.677 14 RLRAL 38 RIRLA 0.668 83 KGEEV 22 TLLMA 0.667 22 YLTEQ 21 RTLLM 0.663 4 YKILV 42 ATEMM 0.658 83 KGEEV 54 DIEEC 0.657 22 YLTEQ 18 ADDRT 0.650 22 YLTEQ 54 DIEEC 0.646 18 LLERY 86 --AES 0.644 15 LRALL 54 DIEEC 0.643 18 LLERY 38 RIRLA 0.636 22 YLTEQ 45 MMSAE 0.633 22 YLTEQ 86 --AES 0.631`

The two residues with highest mutual information are OmpR residue 14 and EnvZ residue 42. It looks pretty good to me.

[ UPDATE: The heatmap looks pretty boring, so I'm going to skip it.
But I plotted the top 20 interactions (graphic at the top of the post), the repetition indicates we're on the right track.. ]

 `import sysfrom utils import load_dataimport matplotlib.pyplot as pltdata = load_data('results.2.txt')data = data.strip().split('\n')data = [e.split() for e in data]data = [(int(t[0]), int(t[1]), float(t[2])) for t in data]def f(t): return float(t[2])def part1(): L = [t for t in data if t[0] == t[1]] L = sorted(L,key=f,reverse=True) for t in L[:10]: print str(t[0]+1).rjust(3),round(t[2],3) print for t in L[-10:]: print str(t[0]+1).rjust(3),round(t[2],3) sys.exit() #part1()L = [t[2] for t in data if t[0] != t[1]]X = 1.0plt.hist(L,bins=X*50)ax = plt.axes()ax.set_xlim(0,X)plt.savefig('example.png')# t[0] always > t[1]N = 190data = [t for t in data if t[0] >= N and t[1] < N]aln = load_data('cell3925mmc4.fa')aln = aln.strip().split('>')[1:]aln = [e for e in aln if e.startswith('26250004-26250005/1-457')]envZ_ompR = aln[0].split('\n')[1]for t in sorted(data,key=f,reverse=True)[:20]: i = t[0] rr = i - 187 j = t[1] hk = j + 15 print str(rr).rjust(3), envZ_ompR[i:i+5], print str(hk).rjust(3), envZ_ompR[j:j+5], print '%3.3f' % t[2]`

### Mutual information (2)

We're working on a paper from Michael Laub's lab at MIT (Skerker et al 2009 PMID 18555780). The first post is here.

In this part we'll load the alignment (supplementary data file S4---the annotation on the page is incorrect), and crunch the numbers. I just write the results to disk.

We'll do the analysis in another post.

`python info.py > results.txt`

 `import sysfrom utils import load_dataimport info_helper as ih#aln = 'AASSASSTTT\nNMWWNTTKKS\nGTSNTYRSTA\nGGGGGGGGGG'fn = 'cell3925mmc4.fa'data = load_data(fn)data = data.strip().split('>')[1:]data = [e.split('\n')[1].strip() for e in data]def show(data): print 'starting:', len(data) for i in range(7): print i, L = [e for e in data if e.count('-') <= i] print len(L) sys.exit()#show(data)def transpose(L): R = range(len(L[0])) rL = list() for i in R: rL.append(''.join([item[i] for item in L])) return rLdata = [e for e in data if e.count('-') <= 4]#data = data[:100]cols = transpose(data)pD = ih.make_prob_dict(cols)info = dict()for i in range(len(cols)): for j in range(i+1): info[(i,j)] = ih.get_info(i,j,cols,pD,v=False)for i,j in sorted(info.keys()): print i,j,round(info[(i,j)],3)`

## Thursday, March 17, 2011

### Fun with geometry (1)

I found a a couple of fun books of problems in geometry, algebra and probability (geometry book here).

This is one of the problems: given the red circles with radius one-half the large black circle, and the blue circle inscribed so as to just fit inside, derive a relation between the radius of the blue circle and the others. This had me scratching my head for a few minutes before the aha moment.

The challenge question is perhaps easier: prove that the filled-in gray area is equal to the area of one of the red circles.

And a hint for the first problem comes from the next graphic, where I've made a copy of the blue circle and positioned it strategically:

## Tuesday, March 15, 2011

### Mutual information

I want to talk about a really nice paper from Michael Laub's lab at MIT (Skerker et al 2009 PMID 18555780). It'll give us an opportunity to exercise our matplotlib skills.

We're going to try to recreate Fig 1, which is visible in the PubMed page, or you can get the original paper from the link to Cell.

Two-component signal transduction systems are ubiquitous in bacteria (wikipedia). The canonical design consists of a membrane-bound sensor (histidine) kinase (HK) and a cytoplasmic response regulator (RR). E. coli contains about 30 such pairs. The members of each pair have substantial specificity. The HK of the ntr system has specificity for its own RR, and likewise in the phoBR system, phoB has specificity for phoR. We may speak of a HK and its cognate RR.

For our purposes the important thing is that each system comprises (in the simplest design) two protein partners with complementary surfaces. These systems (a pair of proteins) are the products of ancient gene duplication events, and have since diverged over time. Amino acids at interacting sites are constrained to co-evolve in each pair.

If this sounds too vague or too complicated, consider an even simpler example: a stem of paired RNA residues in rRNA named H15.

Here is the H15 sequence in 1D (the parentheses indicate residues involved in pairing---see the link above for details):

 `(((( ((((( ))))) ))))TGCACAATGGGCGCAAGCCTGATGCA`

And here is the inner stem drawn in 2D to show the base-pairing more directly:

 `TGGGCGTCCG`

The base-pairing of this stem is more important to rRNA function than the identities of the bases. The result is that in some bacteria the identities of the central bases have been switched:

 `original co-evolvedTGGGC TGCGCGTCCG GTGCG`

Presumably this happened in 2 discrete steps, but I don't know of any examples where the intermediate state has been preserved. Maybe we should look for some, and it's undoubtedly been studied.

To quantify this kind of coevolution, we'll draw on a concept (and mathematical definition) called mutual information. The steps in the calculation will be:

• make a multiple sequence alignment
• compare column X and column Y
• total number of sequences (length of each column) = c

• for each residue x in column X calculate px
• for each residue y in column Y calculate py
• px is the probability of residue x in column X

We'll write the columns horizontally to save space.
Suppose column X and Y are:

 `X: AASSASSTTTY: NMWWNTTKKS`

For column X we have:
`pA = 0.3 (3 A out of a total of 10 residues)pS = 0.4pT = 0.3`
For column Y:
`pK = 0.2pM = 0.1pN = 0.2pS = 0.1pT = 0.2pW = 0.2`
We pre-calculate these values for each column. When we calculate the information, we'll refer to the probabilities for column Y as q rather than p, to keep them straight from the p's for column X.

Now, we consider each pair of residues, one from column X and one from column Y. This pair is made up of residues in two interacting protein surfaces or rRNA chains, that may have co-evolved.

• pxy is the number of sequences with x in column X and y in column Y
• divided by c, the total number of pairs:

 `X: AASSASSTTTY: NMWWNTTKKS`

`pAM = 0.1pAN = 0.2pST = 0.2pSW = 0.2pTK = 0.2pTS = 0.1`
Finally, to compute the mutual information for this pair of columns, we do this calculation for each individual pair of residues and then sum:

`pAM * log (pAM / pAqM) = 0.1 * log (0.1 / (0.3 * 0.1)) = 0.0523`
I would have used log2, but Skerker et al used log10, so I matched them.

Here is part of the output of the script below:

 `NMWWNTTKKSAASSASSTTTpKT 0.2 pK 0.2 qT 0.3 temp 3.33 final 0.1pMA 0.1 pM 0.1 qA 0.3 temp 3.33 final 0.05pNA 0.2 pN 0.2 qA 0.3 temp 3.33 final 0.1pST 0.1 pS 0.1 qT 0.3 temp 3.33 final 0.05pTS 0.2 pT 0.2 qS 0.4 temp 2.50 final 0.08pWS 0.2 pW 0.2 qS 0.4 temp 2.50 final 0.08info 0.47`

temp is the result of the calculation inside the parentheses, above. Next time we'll apply this method to the data from Skerker.

 `from math import logdef log2(n): return log(n)*1.0/log(2)def log10(n): return log(n)*1.0/log(10)# cache character probabilities for each columndef make_prob_dict(cols): # input is a list of columns # c is the number of sequences in the alignment c = len(cols[0]) pD = list() for col in cols: char_kinds = list(set(col)) values = [col.count(k)*1.0/c for k in char_kinds] pD.append(dict(zip(char_kinds,values))) return pDdef get_info(i,j,cols,pD,v=False): col1, col2 = cols[i], cols[j] if v: print col1 + '\n' + col2 # as before, c is the number of sequences c = len(col1) assert c == len(col2) info = 0 pairs = [col1[k] + col2[k] for k in range(c)] pL = sorted(list(set(pairs))) for p in pL: pXY = pairs.count(p)*1.0/c pX = pD[i][p[0]] pY = pD[j][p[1]] inside = (pXY * 1.0) / (pX * pY) if v: print 'p' + p, pXY, if v: print 'p' + p[0], pX, if v: print 'q' + p[1], pY, if v: print 'temp', '%3.2f' % round(inside, 2), outside = pXY * log10(inside) if v: print 'final', round(outside,2) info += outside if v: print 'info', round(info,2) return info if __name__ == '__main__': aln = 'AASSASSTTT\nNMWWNTTKKS\nGTSNTYRSTA\nGGGGGGGGGG' cols = aln.split('\n') pD = make_prob_dict(cols) info = dict() for i in range(len(cols)): for j in range(i): info[(i,j)] = get_info(i,j,cols,pD,v=True)`
• ## Sunday, March 13, 2011

### Cocoa: where to start?

Recommended resources for beginning Cocoa with Objective-C
` NSOrderedDescending`

• do the temperature converter in the Cocoa Application Tutorial
• short articles at Cocoa Dev (here much more here; C review)
• Aaron Hillegass's book
• Cocoa Fundamentals Guide (here)

Specific to PyObjC:

• Will Larson's tutorials (here here here here here)
• Apple's page including a version of the temperature converter
• Read the official introduction carefully (my biggest problem)

A page of links to old material with simple demos of specific Cocoa features that mostly still work here.

Code a simple game like TicTacToe, Fifteen, or Color Sudoku.

After that, I've got tons of projects here and here. Get started with bindings (here), then move on to Vlad the Impaler (here).

Learn specific topics by reading the Apple docs (slowly and repeatedly, it can take a while to get it, by building a simple demo project that does only that one thing. Like NSPredicate which we've done in six posts (here here here here here & one to come).

And if you want the PyObjC templates for Xcode see here.
• ### NSPredicate: PyObjC version

Here is the answer in the previous post, converted to PyObjC. It's a bit simpler, except that some of the methods require a "real" NSArray or NSString.

 `from Foundation import *import objcclass NSString(objc.Category(NSString)): def validate(self): A = self.UTF8String().split(':') fm = NSFileManager.defaultManager() def f(s): print 'test', s home = NSHomeDirectory() s = s.replace('~',home) return fm.fileExistsAtPath_(s) return all([f(s) for s in A]) s = NSString.stringWithString_('~/Desktop')print s.validate()for s in ['~:~/Desktop','xyz']: s = NSString.stringWithString_(s) f = NSExpression.expressionForConstantValue_(s) e = NSExpression.expressionForFunction_selectorName_arguments_( f,'validate',None) results = e.expressionValueWithObject_context_(None,None) print resultsprintp = NSPredicate.predicateWithFormat_( "FUNCTION(SELF, 'validate') isEqual:YES")A = ['~/Desktop','xyz']A = [NSString.stringWithString_(s) for s in A]A = NSArray.arrayWithArray_(A)for item in A.filteredArrayUsingPredicate_(p): print itemprint p.evaluateWithObject_(NSString.stringWithString_('~/Desktop'))`

output:

 `test ~/DesktopTruetest ~test ~/DesktopTruetest xyzFalsetest ~/Desktoptest xyz~/Desktoptest ~/DesktopTrue`

### NSPredicate: problem solved (almost)

It only took me about four hours, but with big help from this post, I solved the problem of filtering a string containing an array of (possibly invalid) filepaths. And, in the meantime, Dave DeLong himself had responded to my question on StackOverflow from yesterday, which if I'd just waited a bit more, would have saved me some time. Thanks, Dave.

As I mentioned (here), we're just wrapping a filtering routine up in a category on NSString. Here is the first part of the code file including the category:

 `#import @interface NSString (ValidatingFilepathArray)- (NSNumber *) validate;@end@implementation NSString (ValidatingFilepathArray)- (NSNumber *) validate { NSArray *A = [self componentsSeparatedByString:@":"]; NSString *s, *p; NSFileManager *fm = [NSFileManager defaultManager]; for (p in A) { s = [p stringByExpandingTildeInPath]; if ([fm fileExistsAtPath:s]) { NSLog(@"passed: %@", p); } else { NSLog(@"failed: %@", p); return [NSNumber numberWithBool:NO]; } } return [NSNumber numberWithBool:YES];}@end`

In the code at the bottom of the post, we construct a "function expression" from the new NSString method. Two things really confused me. First, in the method

 `expressionForFunction:selectorName:arguments:`

the "function" is the object itself. And the second thing was figuring out the format string for the predicate that wraps up our expression:

 `@"FUNCTION(SELF, 'validate') isEqual:YES"`

This is the output:

 `> gcc -o test test.m -framework Foundation> ./test2011-03-13 16:53:42.974 test[3494:903] passed: ~2011-03-13 16:53:42.976 test[3494:903] passed: ~/Desktop2011-03-13 16:53:42.977 test[3494:903] expression for ~:~/Desktop: YES2011-03-13 16:53:42.978 test[3494:903] failed: xyz2011-03-13 16:53:42.978 test[3494:903] expression for xyz: NO-------------------------------------------------2011-03-13 16:53:42.979 test[3494:903] passed: ~2011-03-13 16:53:42.980 test[3494:903] passed: ~/Desktop2011-03-13 16:53:42.980 test[3494:903] failed: xyz2011-03-13 16:53:42.981 test[3494:903] filtered: ~:~/Desktop2011-03-13 16:53:42.981 test[3494:903] passed: ~/Desktop2011-03-13 16:53:42.982 test[3494:903] evaluate: YES`

First we construct the expression and evaluate it. In the second part, we construct the predicate and use it to filter an array or just "evaluate" a string. Still to do: test it in an app with bindings and an NSTextField or NSTableView.

Code:

 `int main (int argc, const char * argv[]) { NSAutoreleasePool * pool = [[NSAutoreleasePool alloc] init]; NSString *validPathArray = @"~:~/Desktop"; NSExpression *f = [NSExpression expressionForConstantValue:validPathArray]; NSExpression *e = [NSExpression expressionForFunction:f selectorName:@"validate" arguments:nil]; NSNumber *result = [e expressionValueWithObject:nil context:nil]; NSArray *responses = [NSArray arrayWithObjects:@"NO",@"YES",nil]; NSLog(@"expression for %@: %@", validPathArray, [responses objectAtIndex:[result intValue]]); NSString *invalidPathArray = @"xyz"; f = [NSExpression expressionForConstantValue:invalidPathArray]; e = [NSExpression expressionForFunction:f selectorName:@"validate" arguments:nil]; result = [e expressionValueWithObject:nil context:nil]; NSLog(@"expression for %@: %@", invalidPathArray, [responses objectAtIndex:[result intValue]]); printf("-------------------------------------------------\n"); NSPredicate *p; p = [NSPredicate predicateWithFormat:@"FUNCTION(SELF, 'validate') isEqual:YES"]; NSArray *A = [NSArray arrayWithObjects: validPathArray, invalidPathArray, nil]; NSArray *fA = [A filteredArrayUsingPredicate:p]; for (id obj in fA) { NSLog(@"filtered: %@", obj); } BOOL yesorno = [p evaluateWithObject:@"~/Desktop"]; NSLog(@"evaluate: %@", [responses objectAtIndex:(int) yesorno]); [pool drain]; return 0;}`

### NSExpression: simple examples

The example from last time (here) introduced the NSExpression class. That one was rather complex in code, though what it does is just a simple filtering of values.

Here is another example which looks less forbidding: it combines two expressions in a single predicate. These can then be used to construct a comparison predicate that grabs the object for key='value' from an array, and checks it against the number 10, like this:

As before the code block is inside a standard main:

 ```#import int main (int argc, const char * argv[]) { NSAutoreleasePool * pool = [[NSAutoreleasePool alloc] init]; code here.. [pool drain]; return 0; }```

and compiled like this:

 `> gcc -o test test.m -framework Foundation`

code:

 ```NSExpression *lhs = [NSExpression expressionForKeyPath:@"value"]; NSNumber *ten = [NSNumber numberWithInt:10]; NSExpression *rhs = [NSExpression expressionForConstantValue:ten]; NSPredicate *p = [NSComparisonPredicate predicateWithLeftExpression:lhs rightExpression:rhs modifier:NSDirectPredicateModifier type:NSGreaterThanOrEqualToPredicateOperatorType options:0]; NSArray *A = [NSArray arrayWithObjects: [NSMutableDictionary dictionaryWithObject: [NSNumber numberWithInt:3] forKey:@"value"], [NSMutableDictionary dictionaryWithObject: [NSNumber numberWithInt:15] forKey:@"value"], nil]; NSArray *fA = [A filteredArrayUsingPredicate:p]; for (id obj in fA) { NSLog(@"%@", [obj description]); }```

 ```> ./test 2011-03-13 10:58:53.885 test[218:903] { value = 15; }```

Here is a second example, taken from the docs, of how to construct an expression that uses a built-in function (they call this a function expression):

 ```NSArray *A = [NSArray arrayWithObjects: [NSNumber numberWithInt:3], [NSNumber numberWithInt:6], nil]; NSExpression *eA = [NSExpression expressionForConstantValue:A]; NSArray *args = [NSArray arrayWithObject:eA]; NSExpression *e = [NSExpression expressionForFunction:@"average:" arguments:args]; id result = [e expressionValueWithObject:nil context:nil]; NSLog(@"%@ %@", [result description], [result class]); float f = [result floatValue]; printf("result = %3.2f\n", f);```

 ```2011-03-13 11:01:00.337 test[236:903] 4.5 NSCFNumber result = 4.50```

There are lots of built-in functions available (here).

The third example is based on the first part (the simple part) of this post from Dave DeLong. It defines a category on NSNumber

 ```@interface NSNumber (FactorialExpression) - (NSNumber *) factorial; @end @implementation NSNumber (FactorialExpression) - (NSNumber *) factorial { double baseValue = [self doubleValue]; double result = tgamma(baseValue+1); return [NSNumber numberWithDouble:result]; } @end```

and the code block is:

 ```NSNumber *n = [NSNumber numberWithDouble:4.2]; NSLog(@"%@ %@", n, [n factorial]); NSLog(@"%p %d", n, [n respondsToSelector:@selector(factorial)]); NSExpression *f = [NSExpression expressionForConstantValue:n]; NSExpression *e = [NSExpression expressionForFunction:f selectorName:@"factorial" arguments:nil]; NSLog(@"operand %@ %@", [e operand], [[e operand] class]); NSLog(@"operand %@", [e function]); id result = [e expressionValueWithObject:nil context:nil]; NSLog(@"%@ %@", [result description], [result class]);```

 ```> ./test 2011-03-13 11:02:46.798 test[251:903] 4.2 32.57809605033135 2011-03-13 11:02:46.800 test[251:903] 0x100108d20 1 2011-03-13 11:02:46.801 test[251:903] operand 4.2 NSConstantValueExpression 2011-03-13 11:02:46.801 test[251:903] operand factorial 2011-03-13 11:02:46.802 test[251:903] 32.57809605033135 NSCFNumber```

And now, the way to solve my initial question seems clear:

Define a category on NSString that does what I want.
Wrap the call up in an NSExpression and then an NSPredicate. Next time, if I succeed.

## Saturday, March 12, 2011

### NSPredicate: compound Cocoa example in code

I'm resisting the temptation to call this simple---it's not! All we're doing is filtering an array of dicts for those whose 'value' is neither too large nor too small. I have no idea why it has to be this complicated. The example is from the NSPredicate docs (as reformatted and with renamed vars by me):

 `#import int main (int argc, const char * argv[]) { NSAutoreleasePool * pool = [[NSAutoreleasePool alloc] init]; NSArray *A = [NSArray arrayWithObjects: [NSDictionary dictionaryWithObject:[ NSNumber numberWithInt:5] forKey:@"value"], [NSDictionary dictionaryWithObject:[ NSNumber numberWithInt:50] forKey:@"value"], [NSDictionary dictionaryWithObject:[ NSNumber numberWithInt:500] forKey:@"value"], nil]; NSExpression *lhs = [NSExpression expressionForKeyPath:@"value"]; NSExpression *gtrhs = [NSExpression expressionForConstantValue:[NSNumber numberWithInt:10]]; NSExpression *ltrhs = [NSExpression expressionForConstantValue:[NSNumber numberWithInt:100]]; NSPredicate *gtpred; gtpred = [NSComparisonPredicate predicateWithLeftExpression:lhs rightExpression:gtrhs modifier:NSDirectPredicateModifier type:NSGreaterThanOrEqualToPredicateOperatorType options:0]; NSPredicate *ltpred; ltpred = [NSComparisonPredicate predicateWithLeftExpression:lhs rightExpression:ltrhs modifier:NSDirectPredicateModifier type:NSLessThanOrEqualToPredicateOperatorType options:0]; NSPredicate *pred; pred = [NSCompoundPredicate andPredicateWithSubpredicates: [NSArray arrayWithObjects:gtpred, ltpred, nil] ]; NSArray *fA = [A filteredArrayUsingPredicate:pred]; for (id obj in fA) { NSLog(@"%@", [obj description]); } return 0;}> gcc -o test pred.m -framework Foundation> ./test2011-03-12 11:48:31.640 test[70403:903] { value = 50;}`

### NSPredicate: simple Cocoa examples

I'm exploring predicates and expressions in Cocoa using Objective-C. Actually, I'd like to use PyObjC, but this is tricky enough that I think it's better to start with Objective-C.

Ultimately I'd like to write complex predicates, for example, to filter using either a custom function or a block. The reason I need this is to validate edits to an NSTableView using bindings. According to the docs:

A predicate is a logical operator that returns a Boolean value (true or false). There are two types of predicate; comparison predicates, and compound predicates:

● A comparison predicate compares two expressions using an operator. The expressions are referred to as the left hand side and the right hand side of the predicate (with the operator in the middle). A comparison predicate returns the result of invoking the operator with the results of evaluating the expressions.

● A compound predicate compares the results of evaluating two or more other other predicates, or negates another predicate.

All the code is like this:

 ```#import int main (int argc, const char * argv[]) { NSAutoreleasePool * pool = [[NSAutoreleasePool alloc] init]; .. specific code here .. [pool drain]; return 0; }```

compiled and run like this:

 ```> gcc -o test pred.m -framework Foundation > ./test```

#### Example 1. Predicate with a format string using "like"

(the c is a case-insensitive compare, the * is a wild card):

 ```NSArray *A = [NSArray arrayWithObjects: @"John",@"Paul",@"George",@"Ringo",nil]; NSPredicate *p = [NSPredicate predicateWithFormat: @"SELF like [c] 'g*' "]; NSLog(@"%@", [[A filteredArrayUsingPredicate:p] objectAtIndex:0]); > ./test 2011-03-12 09:38:55.379 test[67230:903] George```

#### Example 2. Format string with a mathematical operator

 ```NSArray *A = [NSArray arrayWithObjects: [NSNumber numberWithInt:3], [NSNumber numberWithInt:5],nil]; NSPredicate *p = [NSPredicate predicateWithFormat: @"SELF > 4 "]; NSLog(@"%@", [[A filteredArrayUsingPredicate:p] objectAtIndex:0]); > ./test 2011-03-12 09:38:22.313 test[67217:903] 5```

Note: also works with the usual kind of string formatting `:@"SELF > %d ", 4`

#### Example 3: Format string using a key path

 ```NSArray *A = [NSArray arrayWithObjects: [NSMutableDictionary dictionaryWithObject:@"x" forKey:@"name"], [NSMutableDictionary dictionaryWithObject:@"y" forKey:@"name"], nil]; NSPredicate *p = [NSPredicate predicateWithFormat:@"name like %@", @"x"]; NSLog(@"%@", [[A filteredArrayUsingPredicate:p] objectAtIndex:0]); > ./test 2011-03-12 10:01:43.963 test[67885:903] { name = x; }```

#### Example 4: Format string substituting a dynamic key path

 ```NSArray *A = [NSArray arrayWithObjects: [NSMutableDictionary dictionaryWithObject:@"x" forKey:@"name"], [NSMutableDictionary dictionaryWithObject:@"y" forKey:@"name"], nil]; //NSLog(@"%@", [[A objectAtIndex:0] description]); NSString *name = @"name"; NSString *value = @"x"; NSPredicate *p = [NSPredicate predicateWithFormat:@"%K like %@", name, value]; NSLog(@"%@", [[A filteredArrayUsingPredicate:p] objectAtIndex:0]); > ./test 2011-03-12 09:29:59.220 test[67013:903] { name = x; }```

#### Example 5. Compound predicate

 ```NSArray *theKeys = [NSArray arrayWithObjects:@"name",@"value",nil]; NSArray *o1 = [NSArray arrayWithObjects:@"x",[NSNumber numberWithInt:5],nil]; NSArray *o2 = [NSArray arrayWithObjects:@"y",[NSNumber numberWithInt:7],nil]; NSArray *o3 = [NSArray arrayWithObjects:@"z",[NSNumber numberWithInt:9],nil]; NSArray *A = [NSArray arrayWithObjects: [NSMutableDictionary dictionaryWithObjects:o1 forKeys:theKeys], [NSMutableDictionary dictionaryWithObjects:o2 forKeys:theKeys], nil]; NSLog(@"%d", [A count]); NSPredicate *p = [NSPredicate predicateWithFormat: @"(value < %@) OR (name == %@)", [NSNumber numberWithInt:6], @"y"]; NSArray *fA = [A filteredArrayUsingPredicate:p]; NSLog(@"%d", [fA count]); for (id obj in fA) { NSLog(@"%@", [obj description]); } > ./test 2011-03-12 10:34:11.921 test[68778:903] 2 2011-03-12 10:34:11.924 test[68778:903] 2 2011-03-12 10:34:11.925 test[68778:903] { name = x; value = 5; } 2011-03-12 10:34:11.925 test[68778:903] { name = y; value = 7; }```

### Dental project (6)

I want to show some more results from this project, namely, the UniFrac analysis. What I did for the paper was to cluster very closely related sequences (alignment > 450 and 0 or 1 mismatches), then upload them to RDP, which aligns the sequences as they are uploaded. The phylogenetic tree needs to be rooted, and I decided to use Thermotoga SL7 for this (Genbank AJ401017).

Rather than deal with the clustered OTUs for this post, I just uploaded all 1120 sequences, and carried out the analysis. The first time through (today) I forgot to include the outgroup! So that gives us a chance to see how much difference it makes.

• Working in directory: `temp`
• Check that seqs.fna from dental project dir has 1120 seqs
• Rename to `dental_1120.fna`
• Download as `dental_1120_rdp.fna`

• Use R/ape to make a tree

 `setwd('Desktop/temp/rdp')library(ape)dna = read.dna('dental_1120_rdp.fna',format='fasta')tr = nj(dist.dna(dna))plot(tr)write.tree(tr,'tree.txt')`

• Write a simple script to make the environment file

 `python write_env.py > environ.txt`

It looks like this:

 `DAA_44 D_DAADAA_43 D_DAADQ_209 D_DQ..`

• Upload tree and environment file to UniFrac
• PCA (unweighted)
• View as data table

• Download data to `pca_web.txt`
• Run the script below to plot the data using matplotlib.

Here it is:

From the UniFrac FAQ:

My tree was not rooted, but I was able to upload my file and perform an analysis. Are the results valid?

There is no way to tell based on a Newick string alone whether a tree is rooted or not. If an unrooted tree is input, UniFrac will usually assign an arbitrary root and allow you to perform the analysis on that tree. How the tree is rooted can affect the results of both UniFrac tests and the P test. You should redo the analysis with a tree that is rooted with an appropriate outgroup.

It turns out to be easy enough..go back to RDP and browse to find SL7 and then add it to the sequence cart. Repeat the download to `dental_1120+_rdp.fna`. Load the last 5 sequences into clustalx.app and look to make sure that SL7 is really properly aligned.

 `setwd('Desktop/temp/rdp')library(ape)dna = read.dna('dental_1120+_rdp.fna',format='fasta')tr = nj(dist.dna(dna))> trPhylogenetic tree with 1121 tips and 1119 internal nodes.Tip labels: DAA_44, DAA_43, DQ_209, DAA_45, DAA_40, DC_81, ...Unrooted; includes branch lengths.> grep('SL', tr\$tip.label)> 1121`

• Root the tree appropriately and write it to disk

 `tr2=root(tr,1121)plot(tr2)write.tree(tr2,'rooted_tree.txt')`

• Go back to UniFrac

Repeat the PCA. You can look at the data in a spreadsheet app:

Now I plot it in matplotlib. The first image is what I plotted today for the rooted tree. The second is from the paper. Looks pretty good to me. Also, note some minor differences from the previous graphic where the tree we used was unrooted (and UniFrac rooted it for us however it does when it's not properly rooted).

`plot_web.py`

 `import sysimport matplotlib.pyplot as pltfrom fileUtilities import load_datad = 0.5fn = 'pca_web.txt'data = load_data(fn)data = data.strip().split('\n\n')[0]data = data.strip().split('\n')[1:]L = list()for e in data: name, x, y = e.split()[:3] x,y = float(x), float(y) x *= -1 name = name[2:] if name[1] in 'BCM': c = 'blue' else: c = 'red' plt.scatter(x,y,s=100,color=c) if name == 'DG': y += 0.03 plt.text(x+0.03,y-0.02,va='center', s=name[1:],color=c,fontsize=16) plt.plot((-d,d),(0,0),':',zorder=0) plt.plot((0,0),(-d,d),':',zorder=0)ax = plt.axes()ax.set_xlim(-d,d)ax.set_ylim(-d,d)plt.savefig('pca_web.png')`
• ### Dental project (5)

This post is one of a series (see dental project here or in the sidebar).

Last time I said I would show you how I make heatmaps these days. I've approached it several different ways over the past few years (R, Cocoa, matplotlib), but I think now that matplotlib is best, at least for me. Ultimately what I want is flexibility, and if you're a Python coder and you have matplotlib installed (as we've also discussed many times), then you'll have that. But I don't want to get into the technical details---and actually the script is a bit long, so I just put it (`Heatmapper.py` and its helper `Preprocessor.py`) into the zipped project files on Dropbox (here). The output from two different modes is at the bottom of the post. You just need a file `data.csv` in the same directory. It looks a little fuzzy and not as clean as I would like, but that's because there are so many samples, and partly because of the italic font. If you do `savefig` to a pdf file, and then blow it up, it looks great.

In this post I want to talk in a general way about the project and what I think it means. It began about four years ago, when we became aware that some folks in Dentistry at our school (WVU) were involved in a huge study of people from Appalachia (it's called COHRA). Poor oral health is a particular problem in West Virginia, and this study had collected thousands of samples along with patient histories and lots of clinical data. My belief is that the important thing about these samples is that the patients are young yet have serious periodontal issues. In any event, we convinced the people who actually run the project (based elsewhere) to let us have (a small part of) 8 samples out of all their thousands sitting in the freezers down the hall.

We did PCR with "universal" primers for the bacterial 16S rRNA gene, and cloned and sequenced the numbers you see in the table. It's not a big study (we don't have much money anyway), but we saw something which I think is truly significant. In high disease individuals, a broad group of microbes from the Clostridiales including an unusual clade called the Veillonellaceae are increased in abundance, whereas the sequences from control individuals in this clade were all very closely related to Veillonella parvula.

One reason this observation may be important is that the so-called "red complex", which is thought to be associated with serious periodontal disease, can only be recovered in about half the individuals with this diagnosis (not even considering abundance).

That story is in the modified version of the map above, where I drew a red box around the region of interest for the three controls, or "low disease" samples. Time went on, and another set of samples was added to the study from a different group, and we were able to get the work published. So that's why the study looks so old-fashioned, in an era of millions of reads, we've got about a thousand.

My role in all this was to actually do the analysis. I remember "we" wrote a grant (actually, someone else did!) and listed me as a technical expert in bioinformatics. Of course, the reviews were scathing. Dr. E doesn't have a degree in bioinfomatics. How could he know anything?

Well, I've learned a few things over the years. Rule one is, never make your own database: let someone else do it. That's why HOMD (and Greengenes and RDP) are so great. I particularly like the tools at the RDP site. It is very nice software.

And rule two is, if you live long enough, you will see work that took you months or years to accomplish be achieved using new tools in mere seconds or hours. Sequencing is a great example of this. When I was young I spent most of three months getting 500 bp; when I was a bit older I invested six months for 3.5 kb; still later it was a year for 20 kb.

This project is another example. I spent a year and more writing some 50 or so Python scripts (and rewriting them), and now QIIME does the whole thing in mere seconds.

Well, not quite the whole thing. I have a bit more to do with this project. I want to show you the UniFrac analysis of beta diversity, and show how to make what I think is a nicer plot of the PCoA results. Also, I want to show some phylogenetic trees detailing the increased diversity (species richness, really) in the Veillonellaceae that I mentioned.

And I should say: it's been fun. Even if I don't have that degree, or any papers with Rob Knight, I think I've learned something about Bioinformatics in the last 5 years.

## Thursday, March 10, 2011

### Dental project (4)

This post is one of a series (see dental project here or in the sidebar).

After getting a set of sequences and removing chimeras, the next step is almost anticlimactic. We just copy a modified version of the shell script (from here, without the `cd` calls) or paste in the commands working from the `dental` directory (either individually, or all at once):

 `#!/bin/bashpick_otus.py -i seqs.fna -m uclust -s 0.97 -o otuspick_rep_set.py -f seqs.fna -i otus/seqs_otus.txt -m most_abundant -o otus/reps.txtalign_seqs.py -i otus/reps.txt -m pynast -t ~/data/core.txt -o alnassign_taxonomy.py -i otus/reps.txt -m rdp -o taxfilter_alignment.py -i aln/reps_aligned.txt -m ~/data/mask.txt -o aln2make_phylogeny.py -i aln2/reps_aligned_pfiltered.fasta -o figs/tree.tremake_otu_table.py -i otus/seqs_otus.txt -t tax/reps_tax_assignments.txt -o figs/otu_table.txtsummarize_taxa.py -i figs/otu_table.txt -o figs/otu_table_Level3.txt -L 3plot_taxa_summary.py -i figs/otu_table_Level3.txt -l Phylum -o figs -k whitemake_otu_heatmap_html.py -i figs/otu_table.txt -o figs`

It's all over in a few seconds.

The heatmap QIIME produced is at the top of the post. It is truly a remarkable html page, with a graphic where you can reorder the columns or rows by drag and drop, and redo the map at different threshholds for the OTUs, etc. I've never seen anything quite like it. But (and this is just me), it's not pretty enough.

So what I'd like to do from here is to show you how I currently make heatmaps with matplotlib, and we'll get into that next time.

First, I have to extract the data from QIIME. The script is complicated a bit by an additional job: I'm going to organize the rows and columns. (QIIME can do this too---see the tutorial).

The columns will be in the order they appear in `sample_names.txt` and the rows as they appear in `genera_and_colors.txt`. These files are in the same directory. The second one starts like this:

 `# Bacteria blackBacteria# Bacteroidetes greenBacteroidetesBacteroidalesRikenellaPrevotellaPorphyromonasSaprospiraceaeSphingobacteriales`

I just do this:

 `> python grab_qiime_data.py > data.csv`

Here's the first part of `data.csv`:

 `,DB,DC,DM,DF,DL,DG,DI,DA,DT,DQ,DV,DAA,DZBacteria,,,,2,,,,,,,,,12Bacteroidetes,,,,,,2,,,,,,,Bacteroidales,,,,,,1,,,,,,,Prevotella,,,,,3,2,,1,,,,,Porphyromonas,,,,,,1,,1,,,,,Sphingobacteriales,,,,,,,,,,1,,,Capnocytophaga,9,,6,2,3,6,1,19,31,35,2,8,7Fusobacterium,1,,,4,,2,1,2,,2,1,3,1`

The leading comma on line 1 is so the column headers line up properly in a spreadsheet. Speaking of spreadsheets, here is a screenshot after dropping the `data.csv` file onto Numbers (you could use Excel, of course):

That was painless! Zipped project files in Dropbox (here).

## Wednesday, March 9, 2011

### Dental project (3)

wikimedia

Before starting on analysis of the 1124 sequences from last time (here), we need to check for chimeras.

And at this point, I have a confession to make. It turns out there are 3 and perhaps 4 chimeras in the set of sequences from Genbank. I discovered this unwelcome fact a few weeks ago when playing with the QIIME toolkit. Since one of the pieces of software they recommend is ChimeraSlayer, I tried it out on these sequences.

Make a directory `temp` with a copy of `seqs.fna`. The sequences first need to be converted to NAST format, then we can run `ChimeraSlayer.pl`.

 `prog1=~/Software/microbiomeutil_2010-11-02/NAST-iEr/run_NAST-iEr.pl\$prog1 --query_FASTA seqs.fna > seqs.nastprog2=~/Software/microbiomeutil_2010-11-02/ChimeraSlayer/ChimeraSlayer.pl\$prog2 --query_NAST seqs.nast`

It takes the better part of an hour on my slowest machine (a 5 year old iMac).

`seqs.nast.CPS.CPC.wTaxons` has flagged four sequences:

 `DA228 INTRA-GENUSDQ822 INTRA-PHYLUMDV55 INTRA-FAMILYDAA89 INTRA-FAMILY`

I grab those four by hand into a new file `suspects.fna` (there is probably a better way) and do:

 `\$prog1 --query_FASTA suspects.fna > suspects.nast\$prog2 --query_NAST suspects.nast --printCSalignments option`

The output shows there is definitely a problem. In `suspects.nast.CPS.CPC.wTaxons` we have:

 `ChimeraSlayer DQ_822 S000427388 S000260335 1.0566 98.74 100 0.7978 74.56 0 YES NAST:1861-1863 ECO:324-325 Streptococcus Streptococcus cristatus (T); NCTC12479; AB008313 Streptococcus cristatus Lachnospiraceae Incertae Sedis Clostridium aerotolerans (T); DSM 5434; X76163 Clostridium aerotolerans INTRA-PHYLUM Per_id parents: 73.80 Per_id(Q,A): 93.45--------------------------------------------------- A: S000427388 99.64 78.63~~~~~~~~~~~~~~~~~~~~~~~~\ /~~~~~~~~~~~~~~~~~~~~~~~~ Q: DQ_822DivR: 1.057 BS: 100.00 |Per_id(QLA,QRB): 98.74 | | (L-AB: 72.50) | (R-AB: 76.92) WinL:0-279 | WinR:280-396 |Per_id(QLB,QRA): 74.56 |DivR: 0.798 BS: 0.00 |~~~~~~~~~~~~~~~~~~~~~~~~/ \~~~~~~~~~~~~~~~~~~~~~~~~~ Q: DQ_822 72.86 96.58---------------------------------------------------- B: S000260335 Per_id(Q,B): 79.85DeltaL: 26.79 DeltaR: -17.95 ! CTAACGAGGAGGCGCTTGGTTAAGGGCTAGCTAAATTGCATGATGGTCAATGGGAAATCC A: S000427388CTAACGAGGAGGCGCTTGGTTAAGGGCTAGCTAAATTGCATGATAGTCAATGGGAAATCC Q: DQ_822TCCGACTAAGATTCGGAATCGGGCACAAGATCTCGACAGGCAGCACAGTGGAACTCCGGT B: S000260335!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! !!!!!!!!!!!!!!! ACCCTTGTCGTGACATC A: S000427388ACCCTTGTCGTGACATC Q: DQ_822GGTGCCTGGACCCAGCT B: S000260335!!!!!!!!!!!!!!!!! ** Breakpoint **!!!!!!!!!!!!! !!!!! !!!!!!!GACCGACGGTCGAGTGTATCGGGGTAAA A: S000427388AGAGAGGACCTCGGTTATATGACAGCGG Q: DQ_822AGAGAGGACCTCGAGTATATCACAGCAG B: S000260335 !! ! ! `

The first match is great for a while, then terrible, and the second is the converse.

I need to look into whether I should update the Genbank records, but I guess probably the answer is yes.

Anyway, I should have discovered this easily. I wrote a Python tool that looks for chimeras by BLAST of the front and back "halves" of each sequence against our local "boutique" database. It prints the top five hits for each. Here is the output for three of the suspects:

 `>DQ822BLAST front(len = 207):320 207/207 100.00 Streptococcus_clone_BP2-57_AB121930.1319 205/207 99.03 Streptococcus_clone_502H08_AM420202.1323 203/207 98.07 Streptococcus_cristatus_AB008313.1321 200/207 96.62 Streptococcus_clone_BW009_AY005042.1334 199/209 95.22 Streptococcus_sanguinis_SK36_SK36BLAST back(len = 187):349 179/187 95.72 Uncultured_clone_4.59_DQ346409.179 179/187 95.72 Clostridiales_clone_301C11_AM420062.1121 178/187 95.19 Eubacterium_clone_DO008_AF385508.1117 178/187 95.19 Eubacterium_clone_BP2-88_AB121960.1115 178/187 95.19 Eubacterium_clone_BL026B96_AY806377.1>DV55BLAST front(len = 210):353 207/210 98.57 Uncultured_clone_E105_DQ326659.1100 206/210 98.10 Dialister_sp._E2_20_AF481209.194 205/210 97.62 Dialister_invisus_AY162469.1 31 190/207 91.79 Allisonella_clone_BL34_DQ130020.193 192/210 91.43 Dialister_clone_MCE7_134_AF481210.1BLAST back(len = 190):373 190/190 100.00 Veillonella_parvula_X84005.1 372 190/190 100.00 Veillonella_clone_X042_AF287781.1370 190/190 100.00 Veillonella_clone_BU083_AF366266.1369 190/190 100.00 Veillonella_clone_AA050_AF287782.1371 182/183 99.45 Veillonella_clone_R1_DQ123569.1>DAA89BLAST front(len = 210):283 209/210 99.52 Selenomonas_clone_CI002_AF287798.1286 202/210 96.19 Selenomonas_clone_EQ054_AF385495.1288 201/210 95.71 Selenomonas_clone_FT050_AY349403.1298 189/195 96.92 Selenomonas_noxia_AF287799.1 297 200/210 95.24 Selenomonas_infelix_AF287802.1BLAST back(len = 190):373 184/187 98.40 Veillonella_parvula_X84005.1 370 184/187 98.40 Veillonella_clone_BU083_AF366266.1372 181/184 98.37 Veillonella_clone_X042_AF287781.1369 181/184 98.37 Veillonella_clone_AA050_AF287782.1371 182/187 97.33 Veillonella_clone_R1_DQ123569.1`

Note on sequence titles: I just introduced the underscore recently (as in DA_228), so this output doesn't have them.

It's pretty obvious that these guys are problematic. What happened is that I integrated the tool into the toolchain, but I never wrote code to look through the output and flag potential problems. I always did it manually, and as additional sequence samples were added to the experiment, I forgot to carry out this step.

Moral of the story: if you want to be sure something gets done, every time, you need to automate it completely! Otherwise you might forget.

We'll remove these from our sequence file by hand. Now there are 1120.

 `DA_228DQ_822DV_55DAA_89`

### David Broder

I'm a political junkie. Can't help it. I start the day with Ezra Klein (after he appears online), and perhaps, end the day with Josh Marshall's gang. So it is with sadness that I hear of David Broder's passing (no link, the Post obit runs an ad).

There was a time in the 90s when I watched Washington Week in Review (when it was something, with Ken Bode as the moderator) and Broder was always impressive. I think he was the last of the giants---think David Brinkley, Walter Cronkite, Charles Kurault.

Now we're left with screeching vapid idiots like---well, anyone on CNN except David Gergen. I miss the time when serious analysis could be found on TV. Sigh. Perhaps a good analogy is this: think of the difference between Hubie Brown (or Doug Collins) and anyone else who calls basketball games. I love Hubie, he's entertaining, but even more important, you can actually learn something.

### Dental project (2)

This is a series, first post here. Before we do anything else, we need to clean up the titles on the sequences. They come from Genbank like this:

 `>gi|324104022|gb|HQ894465.1| Uncultured bacterium clone DA19 16S ribosomal RNA gene, partial sequence`

We want this:

 `>DA_19`
.

We could do something like a regular expression, but that's overkill. Note that the alpha part is variable length, so we have to be a little bit smart. But these are so regular, it's easy. Also, UniFrac wants an underscore, so we add that.

Just do this from the command line:

 `> python retitle.py > seqs.txt`

If you count the sequences, you should have 1124.

`retitle.py`

 `import utils as utdigits = '0123456789'data = ut.load_data('results.txt')data = data.strip().split('\n\n')for item in data: title,seq = ut.clean_fasta(item) e = title.split()[4] d = ''.join([c for c in e if c in digits]) s = ''.join([c for c in e if not c in digits]) print '>' + s + '_' + d print seq print`

### Dental project (1)

I'd like to spend a few posts talking about a project we just completed on surveying the bacterial species present in patients with gingivitis compared to normal controls. The main reason is to exercise QIIME with a different data set, but I'd also like to say a bit about the project.

I checked this morning at PubMed and found the paper has come out:

Olson 2011 PMID 21362199

Click on the link to download, 7.6 MB, that's just about the largest file size for a paper that I ever saw. Opening it up, I see why. It's still in manuscript form, and some of the figures are quite big. To play with this, we'll need to get the sequences. Luckily they were just posted by Genbank the other day.

I wrote a script to grab the sequences in chunks of 40, with a timer to sleep for 10 seconds between requests. The first sign of trouble was here:

 `HQ895465 HQ895504 1URL ErrorHQ895505 HQ895544 1..`

but eventually, we did another request for this batch which looked like it worked:

 `HQ895465 HQ895504 2`

but actually, the file contains this near the end:

 `PubseqAccess cmd(id_gi_by_word HQ895505) failed with Cannot get server name from load balancer 'PUBSEQ_OS_PUBLIC' : errmsg='Service not found' HQ895505`

and then more of the same. Looking at the sequences, it seems they cut us off with 1000 sequences.. stopping with HQ895464.1

I thought this should be OK. It's very early in the morning, with more than 3 seconds between requests, but apparently we ran up against some kind of limit.

I give the server some time to calm down, (and change the name of the file we've written to), edit the list and try again:

 `> python fetchSeqs.py HQ895465 HQ895504 1HQ895505 HQ895544 1HQ895545 HQ895584 1HQ895585 HQ895588 1`

then combine by hand.. Next time we'll take a look at them.

`fetchSeqs.py`

 `import urllib2, sys, timefrom utils import load_datancbi = 'http://eutils.ncbi.nlm.nih.gov'eutils = ncbi + '/entrez/eutils/'efetch = 'efetch.fcgi?'def chunks(L,SZ): rL = list() while L: rL.append(L[:SZ]) L = L[SZ:] return rLdef fetch(L): s = eutils + efetch s += 'id=' + L s += '&db=nucleotide&rettype=fasta&retmode=text' try: FH = urllib2.urlopen(s) data = FH.read() except: raise ValueError('URL Error') if 'NCBI C++' in data: raise ValueError('Empty') elif not data: raise ValueError('Empty') return data def run(FH): first = 894465 #first = 895465 last = 895588 L = ['HQ' + str(n) for n in range(first,last+1)] rL = list() L = chunks(L,SZ=40) L = [(e,1) for e in L] while L: sL,n = L.pop(0) print sL[0], sL[-1], n if n > 3: continue try: s = fetch(','.join(sL)) FH.write(s) except ValueError as e: print e L.append((sL,n+1)) time.sleep(10) if __name__ == '__main__': FH = open('results.txt','w') L = run(FH) FH.close()`

## Tuesday, March 8, 2011

### 16S rRNA V regions, continued

As promised last time (here), this is the code to plot V regions from the sequences in the Enterobacteriales, obtained from RDP. If you compare to this post from the other day, you'll see we match pretty well.

The figure looks better with a wider window, but the limits are established more accurately with a smaller window. The threshold T needs to be adjusted based on the window size.

One detail: I did the redirect below to save the extreme values and then parsed them with the code at the end of the post.

 `python script.py > extreme_values.txt`

`script.py`

 `import utils as utfrom info import shannonimport matplotlib.pyplot as pltdata = ut.load_data('entero.txt')data = data.strip().split('>')[1:]EC = [e for e in data if 'X80725' in e][0]EC = ut.clean_fasta(EC)[1]data = [ut.clean_fasta(e)[1] for e in data]L = ut.make_count_list(data)pos = 0R = range(1,1451)iL = list()for i,c in enumerate(EC): if c == '-': continue pos += 1 if not pos in R: continue cD = L[i] e = shannon(cD,'ACGT') iL.append(e) #print str(pos).ljust(3), c + ' ', #print ''.join([str(cD[k]).ljust(5) for k in 'ACGT']), #print round(e,2)aL = list()w = 20T = 1.8for i in range(len(iL)): j, k = i - w, i + w + 1 if j < 0: j = 0 if k > len(iL): k = len(iL) m = ut.mean(iL[j:k]) if m < T: print i+1 aL.append(m) plt.plot((1,1451),(T,T),lw=2,color='r',zorder=0)plt.scatter(R,aL)ax = plt.axes()ax.set_xlim(-5,1455)ax.set_ylim(0.8,2.05)plt.savefig('example.pdf')`

`analyze.py`

 `import utils as utdata = ut.load_data('extreme_values.txt')L = [int(n) for n in data.strip().split('\n')]current = L.pop(0)print current, '-',while L: next = L.pop(0) if next != current + 1: print current print next, '-', current = nextprint next`