On "overfitted" models:
J.Bertrand said it this way: 'Give me four parameters and I shall describe an elephant; with five, it will wave its trunk.
--L. Le Cam
first page here
behind a firewall, sorry :(
do the jstor thing if you can..
J.Bertrand said it this way: 'Give me four parameters and I shall describe an elephant; with five, it will wave its trunk.
If a man is offered a fact which goes against his instincts, he will scrutinize it closely, and unless the evidence is overwhelming, he will refuse to believe it. If, on the other hand, he is offered something which affords a reason for acting in accordance to his instincts, he will accept it even on the slightest evidence.
protdist
) as a Unix process, as described previously (here).ERROR: end-of-line or end-of-file in the middle of species name for species 2 |
>Ecoli |
steps = [ 0.97, 0.9, 0.8, 0.7, 0.6, 0.5, 0.4 ] |
The chimpanzee curve has a peak whose height is 67 percent as great as that given by HSA.
human 1.0
chimp 1.14
gorilla 1.09
old world monkeys 2.23-2.65
new world monkeys 4.2-5.0 *
prosimians 7.6-18
* one exception
Hominoids: |
|
>>> FH = open('extra.txt','r') |
>>> FH = open('extra.txt','r') |
$ muscle -in homo.pan.mito.txt -out align.txt |
>>> FH = open('align.txt','r') |
('-', 'A') 6 |
scatter
as we've done many times before, and I added the labels by hand. The values are log-scaled to make the smaller ones more visible. The file is on Dropbox (here). The other issue is that the ideas are complex. Now, if you step back and if you spend some serious time thinking about it, the kernel of the ideas are clearly not complex. The essence of an idea is what we try to work with most. As a professor with my graduate students, we would read a paper and look at the paper and say, "What's the essence of the idea? What's something new? Forget about the equations. Forget about the complicated argument and try to identify the kernel of the idea."
0 ['A', 'B', '1'] |
nni.py
) to the zipped files for the project. It's been fun, but I probably won't do the others for a while, if at all. I'm ready for new challenges.
example_tree = { '0':['A','B','1'], |
A:0, C:1, B:0, E:2, D:2, F:3, 1:root, 0:1, 3:1, 2:3 |
A:0, C:1, B:0, E:2, D:2, F:3, 1:1, 0:1, 3:1, 2:3 |
repr
(representation) for each node from the farthest tips working up to the root. The method was inspired by PyCogent's ancestors()
function, which returns a list of ancestors (naturally enough). The output with debug enabled looks like this:ancestors |
repr
is just 'E'. When we process the first i_node (2), it's repr
will be (E,D)2
. If we're using branch lengths, it will be: (E:2.25,D:2.75)2
. The distance data is in a separate dictionary:dist = { ('A','0'):1.0, |
(A:1.00,B:4.00,(C:2.00,(F:4.75,(E:2.25,D:2.75)2:0.75)3:1.25)1:1.00)0; |
len(label) > 1
.tree_dict = { '0':['A','B','1'], '1':['C','0','3'], |
['1', 'C', '0', 'A', 'B', '3', 'F', '2', 'E', 'D'] |
A:0, C:1, B:0, E:2, D:2, F:3, 1:root, 0:1, 3:1, 2:3 |
utils
contains flatten
from here.import utils |
import utils |
import sys, random, string |
edge.0 A:1.0; |
Between And Length |
> tr$edge.length |
0 : B 4.000 A 1.000 |
Outgroup root? No, use as outgroup species 1 |
neighbor
as a process (just like here). One wrinkle: the process returns faster than the file write from Phylip, so we have to wait a second before trying to do anything with the files ourselves.6 |
ape
.import os, subprocess, time, sys, random |
port search pdftk |
pdftk file1.pdf file2.pdf cat output outfile.pdf |
pdftk 0.pdf 1.pdf 2.pdf 3.pdf 4.pdf 5.pdf 6.pdf 7.pdf 8.pdf 9.pdf 10.pdf 11.pdf 12.pdf 13.pdf 14.pdf 15.pdf 16.pdf cat output Ch2.1.pdf |
blogger
, I've put up 500 posts over about 2 1/2 years. That's a lot of posts. I've still got much more exploring to do, so I plan to continue talking about it.upgma
and the plotting code! So I guess the moral is: test, test, and test again.0 32 48 51 50 48 98 148 |
debug == True
. As well as all the branch lengths.
ape
package from R. I updated the zipped project files which are on Dropbox (here). The plot is above, it looks good to me.A B C D E |
r(A) = 30 = 5 + 4 + 7 + 6 + 8 |
(ri + rj)/(N-2)
where N
is the number of OTUs. For example, the new value for entry 'AB' is: 5 - ((30 + 42)/4) = -13, and that is the smallest value in the whole matrix.S(AU) = d(AB)/2 + [r(A) - r(B)/(2(N-2))] |
d(UC) = ( d(AC) + d(BC) - d(AB) ) / 2 = 3 |
U C D E |
m = c(0,5,4,7,6,8, |
ape.root
).(((('A', 'B'), 'C'), ('D', 'E')), 'F') A {'up': 1.0, 'parent': 'AB'} C {'up': 2.0, 'parent': 'ABC'} B {'up': 1.0, 'parent': 'AB'} E {'up': 2.0, 'parent': 'DE'} D {'up': 2.0, 'parent': 'DE'} F {'up': 4.0, 'parent': 'ABCDEF'} DE {'to_tips': 2.0, 'right': 'E', 'up': 1.0, 'parent': 'ABCDE', 'left': 'D'} ABCDE {'to_tips': 3.0, 'right': 'DE', 'up': 1.0, 'parent': 'ABCDEF', 'left': 'ABC'} ABC {'to_tips': 2.0, 'right': 'C', 'up': 1.0, 'parent': 'ABCDE', 'left': 'AB'} AB {'to_tips': 1.0, 'right': 'B', 'up': 1.0, 'parent': 'ABC', 'left': 'A'} ABCDEF {'to_tips': 4.0, 'right': 'F', 'left': 'ABCDE'} |
import sys, random |
# convert to R 'matrix' and cluster |
flatten
from here before feeding to R, and this going the other way:def to_array(M,R,C): |
hclust
does a reasonable job of finding the structure we created. Rather than list the code, I put a zipped folder with the files on Dropbox here.
likelihood.py
, and it defines three functions for processing nodes, depending on the nature of their child nodes (both external, both internal, or mixed).if debug: print ..
, where I've been rigid about putting the print on an indented line below the if
. (I've stripped all of those out in the listing below). I went through the printout trying to verify the results of each calculation manually, and it's pretty exhausting. What would be nice is to have another ML program to test the output against, but without implementing the branch lengths properly, I'm not sure I can do that.data = tu.load_data(fn) |
likelihood.py
import math |
PAA, PCC.. |
PAC, PAG, PAT.. |
{A,C,G,T}
. Starting with A, then we can calculate the probabilities for the two branches as:PAA(t1) |
L(X=A) = PAA(t1) PAG(t2) |
L(X=C) = PCA(t1) PCG(t2) |
{A,C,G,T}
is the sum:Σ PNA(t1) PNG(t2) |
PAC(t3) PCA(t1) PCG(t2) |
PAC(t3) L(X=C) |
{A,C,G,T}
. Then there are 16 terms like:PMN(t3) PNA(t1) PNG(t2) |
{A,C,G,T}
as described in the first part:PMG(t4) |
PMG(t4) PMN(t3) PNA(t1) PNG(t2) |
{A,C,G,T}
and calculate probabilities as follows:P(W=A,Y=A) = PAA(t5) |
P(W=A,Y=A) L(Y=A) |
P(W=A,Y=A) L(Y=A) |
P(W=A,Y=A) L(Y=A) P(W=A,Z=A) L(Z=A) |
L(W=A,Y={A,C,G,T},Z={A,C,G,T})
is the sum of:P(W=A,Y=A) L(Y=A) P(W=A,Z=A) L(Z=A) |
Σ Σ P(W=A,Y=M) L(Y=M) P(W=A,Z=N) L(Z=N) |
Σ πN L(W=N) |
The likelihood is calculated for each site in this way, and it is assumed that sites evolve independently. Therefore the likelihood for the complete sequence set, Ltot, is the product of the likelihoods of the sites. It therefore follows that the log of the likelihood is the sum of the logs of the site likelihoods.
log Ltot = Σ {sites} log Lsite |