Python for Bioinformatics: March 2010

Tuesday, March 30, 2010

Genes not patentable?

In the news (New York Times) is a ruling in a case before the US District Court for the Southern District of New York (Judge Robert W. Sweet) that invalidates Myriad Genetics' patent on the BRCA1 and BRCA2 genes (article, opinion).

For everyone I know (scientists), the idea of patenting genes---particularly unmodified genes---is ludicrous. It exemplifies the nature of our age, which we might call capitalism gone wild. On the other hand, the first lawyer I talked to (who happens to be my brother) explained to me "that's the way the world works." (I'm paraphrasing). Money talks, and all that.

Here is an excerpt of the article that gives the main themes:

“It’s really quite a dramatic holding that would have the effect of invalidating many, many patents on which the biotechnology industry has invested considerable money,” said Rebecca S. Eisenberg, a professor of law at the University of Michigan who has written extensively on gene patents.

The Genomics Law Report, an Internet journal, called the decision “radical and astonishing in its sweep.” It headlined its article, “Pigs Fly.”

Although patents are not granted on things found in nature, the DNA being patented had long been considered a chemical that was isolated from, and different from, what was found in nature.

But Judge Sweet ruled that the distinguishing feature of DNA is its information content, its conveyance of the genetic code. And in that regard, he wrote, the isolated DNA “is not markedly different from native DNA as it exists in nature.”

Executive summary:

P1 says that a lot of money is at stake (implies patents are good)
P2 expresses incredulity: "Pigs Fly"
P3 summarizes Myriad's core scientific proposition

To be clear, the argument (which is complete and total bullshit) is that the DNA in cells is different than that which has been purified away from histones and other stuff, or cloned in bacteria, or amplified by PCR. It's been "transformed," presto, and is now something different enough to be patentable. Trust me, it's a stretch. Particularly to a geneticist who has spent his whole career thinking about genes as sequences. In my view, today is the day the Emperor discovered he is naked.

P4 Well, I include paragraph 4 because it illustrates for me how silly it is for the New York Times to claim to be the "paper of record." Some time ago, I emailed them to explain that it is not correct to talk about sequencing a genome as "decoding" it, or explicating its "genetic code." I wish I still had the message I received in reply. They explained that they were not interested in one scientist's or even consensus opinion about English usage. The editors felt strongly that their readers would not be comfortable with "determined the DNA sequence of X" and would like the sound of "decoded X." Accuracy be damned.

So there you have it. If you have time, read the decision. Yet another example of the intelligence to be seen on the U.S. District Courts of the nation.

Idiots

They call themselves the "paper of record":

FAA, NTSB Investigate Near Mid-Air Crash Over SF
By THE ASSOCIATED PRESS
Published: March 30, 2010

Filed at 8:21 p.m. ET

SAN FRANCISCO (AP) -- Federal investigators are looking into the weekend near collision of a commercial jet and small airplane that came within 300 feet of each other over San Francisco.
...

The closest the planes came to each other was 300 feet vertically and 1,500 feet horizontally, Gregor said.

(from the New York Times)

>>> math.sqrt(300**2 + 1500**2)
1529.7058540778355

Poisson and airline reservations

In the problems for Chapter 5.1, Grinstead and Snell have the following:

An airline finds that 4 percent of the passengers that make reservations on a particular flight will not show up. Consequently, their policy is to sell 100 reserved seats on a plane that has only 98 seats. Find the probability that every person who shows up for the flight will find a seat available.

Attempt #1:
Each seat that is sold is a Bernoulli trial with Prob(no-show) = 0.04.

N = 100, so the expected number of no-shows for a flight = N p = 4.
The variance is N p (1-p) = 3.84, sd = 1.95

If the number of no-shows exceeds ≈ 1 sd, there won't be a seat for someone.
I expect this to happen about 0.5*e^-1 of the time = 0.18.
The desired probability is 1 minus this.

The problem with this is that no-shows are not normally distributed!

Attempt #2:

If we use the Poisson approximation we have

P = λ^k / k! e^-λ
lambda = 4
P(0 no-show) = e^-4 = 0.018
P(1 no-show) = 4 * 0.018 = 0.073

P(nobody gets bumped) = 1 - 0.073 - 0.018 = 1 - 0.091 = 0.919

Attempt #3 is a simulation (Python code below.
Note the number of seats simulated is 10⁸ !).
The program prints the probability that no one gets bumped, and the st dev for the simulated values.

$ python problem28.py 
0.087023 0.00877316767194

import numpy as np
import random

def overbooked():
    rL = list()
    for i in range(100):
        if random.random() > 0.96:  pass
        else: rL.append(1)
    return sum(rL) > 98

N = 1000
L = list()
for i in range(1000):
    S = sum([overbooked() for i in range(N)])
    L.append(S*1.0/N)
A = np.array(L)
print np.mean(A), np.std(A)

Clustering with UPGMA

While there is still much more to talk about with respect to models of sequence of evolution, I'm going to introduce a new topic in phylogenetics today: distance methods of building a tree. The classic example of this was "popularized" by Sokal and Sneath (according to Felsenstein, p. 161). It's called UPGMA (unweighted pair-group method with arithmetic mean). This is a clustering algorithm that uses an average linkage method, and when applied to a set of distances between objects it gives a rooted tree.

Suppose we want to compute the distance between two clusters of objects (ABC and DE). We tally the distances between each pair of items (AD, AE, BE, BE, CD, CE) and average them.

To show the method in practice, let's start with the answer---that usually makes things a lot easier. In the tree at the top, we have six species, OTUs, or just objects listed as A through F in the figure, and also the indicated tree showing the paths by which the OTUs have diverged from each other. Distance on the tree is indicated as different chunks of a unit distance. We derive the following table of distances:

     A    B    C    D    E    F
A    -
B    2    -
C    4    4    -
D    6    6    6    -
E    6    6    6    4    -
F    8    8    8    8    8    -

The method is simply to choose the pair of nodes that are closest, and average them together to construct a new node, and then compute the distance of the new node from all the others. We remove the original nodes from the table.

Here, A and B are closest, so we'll form the new node AB from them. (We remember the distance from A to B was 2). The distance from AB to the other nodes is the average of the distances for A and B. This is the same for all, so we obtain:

    AB    C    D    E    F
AB   -
C    4    -
D    6    6    -
E    6    6    4    -
F    8    8    8    8    -

We can build the tree corresponding to the first two lines of what we just accomplished (AB,C):

For the next step, we'll break a tie by deciding that D and E are closest, so we collapse them into a single node (remembering that the distance from D to E was 4).

    AB    C   DE    F
AB   -
C    4    -
DE   6    6    -
F    8    8    8    -

Now, we need to combine AB and C. For this case, we need to weight the distances computed for the new node using the arithmetic mean of the distance to each component (ABC to DE). However, in this example, it doesn't matter, since they are all the same.

     ABC  DE   F
ABC   -
DE    6    -
F     8    8   -

Finally, we group ABC with DE.

     ABCDE  F
ABCDE  -
F      8    -

The method assumes a molecular clock (that the true distances between the OTUs are proportional to the observed distances between them), which is always the case for a simple clustering application, and may be a reasonable assumption for closely related species.

Felsenstein has a nice (somewhat more complicated) example based on immunological distances (Sarich 1969 PMID 5373919). The reference is from him---as for many older papers, our crappy library does not provide access, so I have not read it. An important reason to publish in "free" journals.

In the code below, the plot of hclust isn't quite working as I'd like. So, instead (looking ahead) I'll show the neighbor-joining tree from this data.

That results in an unrooted tree.

Here is the R code:

m = c(0,2,4,6,6,8,
      2,0,4,6,6,8,
      4,4,0,6,6,8,
      6,6,6,0,4,8,
      6,6,6,4,0,8,
      8,8,8,8,8,0)
dim(m)=c(6,6)
colnames(m)=c(
  'A','B','C','D','E','F')
rownames(m)=colnames(m)
d = as.dist(m)

plot(hclust(d))

library(ape)
tr=nj(d)
colors=c('blue','blue','darkgreen',
  'red','red','maroon')
plot(tr,cex=2,
  tip.color=colors)
axisPhylo()

Duly quoted

One day the parts of the body argued about who ought to be the boss.

The brain said, “I do the thinking around here, so I should be in charge.”
The hands said, “We do all the work around here, so we ought to be in charge.”
The feet said, “If we didn’t carry the body around, nothing could get done, so we ought to be in charge.”

Every part of the body made its claim, until finally the rectum piped up, “I want to be in charge!”

read the rest
--Mark Kleiman

Jukes-Cantor (8) - using the Poisson distribution

In a previous post, I showed a derivation of the equations (from Higgs & Attwood) describing the probability of change at a sequence position as a function of time in the Jukes-Cantor model.

P_XX(t) = 1/4 + 3/4*e^-4*α*t
P_XY(t) = 1/4 - 1/4*e^-4*α*t

These were obtained by starting with this model

  t            Δt
A => [A,C,G,T] => A

and then deriving and solving the resulting differential equation. This seems to be a pretty standard treatment.

But in Chapter 11 of Felsenstein, I found an alternative approach based on the Poisson distribution, which I think is quite wonderful.

To begin, he defines rates in terms of u (= 3*λ), so the individual rates for changes from X to Y are u/3. He says:

The easiest way to compute this is to slightly fictionalize the model. Instead of having a rate u of change to one of the three other bases, let us imagine that we instead have a rate 4/3*u of change to a base randomly drawn from all four possibilities. This will be exactly the same process, as there then works out to be a probability of u/3 of change to each of the other three bases. We have also added a rate u/3 of change from a base to itself, which does not matter.

Now we use the Poisson distribution.

P(k) = λ^k / k! * e^-λ
P(0) = e^-λ

For a branch with time t, the

probability of no events at all at a site, when the number expected to occur is 4/3*u*t, is the zero term of the Poisson distribution…the branch of time t consists then of a vast number of tiny segments of time dt, each having the small probability of 4/3*u*dt of an event.

The probability of at least one event is 1 minus the zero term:

1 - e^-4/3*u*t )

The probability that it is a particular event (say, A to C) is:

1/4*(1 - e^-4/3*u*t )

The probability of change to any one of the three nucleotides is then:

3/4*(1 - e^-4/3*u*t )

Kyte-Doolittle hydrophobicity plot in Python

I had a request the other day to compute a hydrophobicity plot for a protein. The idea (Kyte & Doolittle, PMID 7108955) is to assign a "hydrophobicity" value to each amino acid. We slide a window along a protein and sum the values in the window, then plot the result. The plot shown above suggests the presence of six long hydrophobic stretches in this protein, which could span the membrane and thus mark a transition from "inside the cell" to "outside the cell" (or the reverse).

A   1.8 C   2.5 D  -3.5 E  -3.5
F   2.8 G  -0.4 H  -3.2 I   4.5
K  -3.9 L   3.8 M   1.9 N  -3.5
P  -1.6 Q  -3.5 R  -4.5 S  -0.8
T  -0.7 V   4.2 W  -0.9 Y  -1.3

I found a page at ExPASy that will do this.

It's pretty sophisticated, with 50 or so possible classifications (e.g. alpha-helix and beta-sheet). Since it's a fun (and simple) Python problem, I downloaded their data, as shown above. Most of the code (below) involves parsing the file with the data into a Python dictionary, and reading the file with the protein sequence. The code to scan the sequence is trivial.

As an example, I chose a protein analyzed by my friend Dana Boyd and colleagues, E. coli malG (Boyd 1993 PMID 8419303). They use alkaline phosphatase fusions to check the predictions of the topology model, as shown in this figure from the paper.

The idea with the fusions is that active alkaline phosphatase requires disulfide bonds, which only form in the periplasm (outside the cytoplasm). If we attach (fuse) alkaline phosphatase coding sequence at the points indicated by the arrows, we predict the fusions made at points predicted to be external to the cytoplasm will have high activity, while the internal ones will have low activity. As you can see, the results fit the model.

My plot using matplotlib is shown above.

import string
import numpy as np
import matplotlib.pyplot as plt

D = {'Ala':'A', 'Arg':'R', 'Asn':'N', 'Asp':'D',
     'Cys':'C', 'Gln':'Q', 'Glu':'E', 'Gly':'G',
     'His':'H', 'Ile':'I', 'Leu':'L', 'Lys':'K',
     'Met':'M', 'Phe':'F', 'Pro':'P', 'Ser':'S',
     'Thr':'T', 'Trp':'W', 'Tyr':'Y', 'Val':'V'}
     
fn = 'hydro.data.txt'
FH = open(fn,'r')
data = FH.read()
FH.close()
L = data.strip().split('\n')

HD = dict()
for e in L:
    aa, rest = e.split(':')
    rest = rest.strip()
    n = rest[rest.index('.')-1:]
    n = float(n)
    if '-' in rest:  n *= -1
    HD[D[aa]] = n
    
for i,k in enumerate(sorted(HD.keys())):
    if i and not i % 4:  print 
    print k, str(HD[k]).rjust(5),
print
#=============================================
fn = 'malG.seq.txt'
FH = open(fn,'r')
data = FH.read()
FH.close()
L = [c for c in data if c in string.lowercase]
prot = ''.join(L).upper()
#print prot[:50]
#=============================================
window = 15
delta = 2
rL = list()
for i in range(0,len(prot)-window+1,delta):
    pep = prot[i:i+window]
    L = [HD[aa] for aa in pep]
    rL.append((i, pep, sum(L)))

for e in rL:
    if e[2] > 20:  print e
X = np.arange(len(rL)) * delta
Y = [e[2] for e in rL]

plt.scatter(X,Y,s=40,color='r',zorder=2)
plt.plot(X,Y,'--',lw=1.5,color='k',zorder=1)
plt.savefig('example.png')

Friday, March 26, 2010

Just showin' off

This is a meta-post, a post about the blog. I started nearly two years ago, and I'm still having fun (after more than 400 posts!). I sincerely hope someone else out there in the ether is enjoying it too, or will some day in the future, though I confess to wondering about this from time to time. I have written some posts explicitly for my son, but he doesn't seem to be ready yet.

I appreciate that the eclectic choice of topics is likely to be a distraction for people, but I am pretty impulsive by nature---and I thoroughly enjoy my lack of discipline.

The blog is a way for me to try to record the little bit of understanding that I have gained on some point of interest (usually very recently). I am also stockpiling posts on topics that I plan to provide for students of my bioinformatics class, if the day ever comes when I teach that class again. I guess the main point I'm trying to make now is that I'm enjoying myself, but often feel like I'm broadcasting in a vacuum.

So I was quite surprised to find a very generous comment from Joe Felsenstein here the other day. In fact I went to find my wife so I could brag to her about it. It made me incredibly happy. But I also felt guilty, because I had made a thoughtless negative comment in the post, for which I need to atone.

So here goes...do you feel like Metropolis-Hastings MCMC has the feel of something really, really important, but you don't quite understand it? Me too. While I don't feel totally confident about it yet, I did achieve a kind of mathematical satori about this while reading Chapter 18 in Inferring Phylogenies. For the serious student, his book is the place to start. On this and many other topics. Thank you, Professor Felsenstein.

Wednesday, March 24, 2010

Arrows in matplotlib

The last example would have looked better with arrows. Here is a first attempt. The getArrow function is complicated a bit by my wanting to adjust the start and stop positions of the arrows so that they don't overlap the points. The commented print statements were useful in debugging that function. (It took me a lot longer than I care to admit).

I have a small problem with the arrows plotting after the points. The z-order for an arrow is not respected by plt. So the arrows plot over the points, and I haven't figured out how to fix that yet.

import numpy as np
import matplotlib.pyplot as plt

def getArrow(p1,p2,i):
    # we need to subtract some from each end
    # slope = m
    w = p2.x - p1.x
    h = p2.y - p1.y
    #print p1.x,p1.y
    #print p2.x,p2.y
    #print 'w',w,'h',h
    
    dr = 0.03
    if w == 0:
        dy = dr
        dx = 0
    else:
        theta = np.arctan(np.abs(h/w))
        dx = dr*np.cos(theta)
        dy = dr*np.sin(theta)
    #print 'dx',dx,'dy',dy
    
    if w < 0:  dx *= -1
    if h < 0:  dy *= -1
    #print 'dx',dx,'dy',dy    
    w -= 2*dx
    h -= 2*dy
    #print 'w',w,'h',h
    x = p1.x + dx
    y = p1.y + dy
    #print 'x',x,'y',y
    
    a = plt.Arrow(x,y,w,h,
                  width=0.05,zorder=i+1)
    a.set_facecolor('0.7')
    a.set_edgecolor('w')
    return a

class Point:  pass

N = 10
L = np.random.uniform(size=N*2)
pL = list()
for i in range(0,N*2,2):
    p = Point()
    p.x,p.y = L[i],L[i+1]
    pL.append(p)
    
ax = plt.axes()
for i,p in enumerate(pL):  
    if i:
        a = getArrow(pL[i-1],p,i)
        ax.add_patch(a)
    plt.scatter(p.x,p.y,s=250,zorder=1)

ax.set_xlim(-0.01,1.01)
ax.set_ylim(-0.01,1.01)
plt.savefig('example.png')

Branch and Bound (2)

Following up on the post from last time, which introduces a Python simulation of Branch and Bound for the problem shown above: the shortest path between points in 2D space.

Here is a bigger problem (N = 10). The points are:

0.142  0.748  0.38   0.98   0.547  0.498  0.834  0.283  0.242  0.916 
0.848  0.432  0.109  0.474  0.828  0.964  0.066  0.574  0.274  0.839

The output is:

N =  10
3628800 possibilities
i =       0 new smallest = 5.52
i =       2 new smallest = 5.205
i =       8 new smallest = 4.783
i =      19 new smallest = 4.774
i =      21 new smallest = 4.683
i =      99 new smallest = 4.666
i =     300 new smallest = 4.625
i =     324 new smallest = 4.238
i =     603 new smallest = 4.186
i =     627 new smallest = 4.174
i =    2244 new smallest = 4.128
i =    2256 new smallest = 3.746
i =    2262 new smallest = 3.734
i =    4114 new smallest = 3.495
i =    7284 new smallest = 3.326
i =    9520 new smallest = 3.274
i =    9640 new smallest = 3.262
i =   74468 new smallest = 3.226
i =  139466 new smallest = 3.185
i =  139476 new smallest = 2.952
i =  140647 new smallest = 2.943
i =  179796 new smallest = 2.842
i =  273364 new smallest = 2.778
i =  480496 new smallest = 2.776
i = 1038668 new smallest = 2.697

I recorded the iteration and value at which we obtained a new shortest path. At the bottom, the results are compared for a total enumeration (first) and then the branch and bound result. The timings indicate that the simulation is quite slow, and also that the algorithm isn't helping us much. I would bet that the reason is we don't abandon a putative solution until rather late in the calculation. Because the distances are too regular, our saved minimum value is not exceeded until close to the end.

min
2 8 7 0 5 4 9 3 1 6  
2.697
58.8522 sec
2 8 7 0 5 4 9 3 1 6  
2.697
44.89 sec

When I find a minute, I will modify the script to record when we bail from each calculation, and plot those as a histogram.

Also, it seems clear that this approach can only help us by some constant fraction of the total time. It doesn't really improve the crummy performance as N gets larger.

Code:

def try_all(N,D):
    smallest = N*1
    retval = None
    for i,L in enumerate(permutations(range(N))):
        dist = total_distance(L,D)
        if dist < smallest:
            smallest = dist
            print 'i =', str(i).rjust(7),
            print 'new smallest =', round(smallest,3)
            retval = repr(L,dist)
    return retval      

def branch_and_bound(N,D):
    smallest = N*1
    retval = None
    for L in permutations(range(N)):
        dist = total_distance(L,D,n=smallest)
        if not dist:  continue
        if dist < smallest:
            smallest = dist
            retval = repr(L,dist)
    return retval

def partTwo():  
    N = 10
    A,D = init(N)
    plot(A,D)
    print 'N = ', N
    print math.factorial(N), 'possibilities'
    then = time.time()
    retval = try_all(N,D)
    now = time.time()
    print
    print 'min'
    print retval
    print round(now-then,4), 'sec'
    then = now
    retval = branch_and_bound(N,D)
    now = time.time()
    print retval
    print round(now-then,2), 'sec'
    
if __name__ == '__main__':
    #partOne()
    partTwo()

Branch and Bound

As we explored in the last post, one of the big difficulties in phylogenetic analysis is that the number of possible trees is enormous. The various ways of approaching the problem of exploring a huge space of possibilities, too big to enumerate, is a large part of phylogenetic analysis. Felsenstein discusses one particular approach in Chapter 5---branch and bound.

This is a general technique for solving problems like the Traveling Salesman Problem (TSP), or in our case, evaluation of the maximum parsimony tree.

A simple version of TSP is the following. We pick N points in 2D space. We want to find a path that visits each point once, and covers the shortest distance. If there are N points, then there are N! possible paths.

We decide to generate all the permutations of the N points, and we have evaluated a few of these putative solutions. We keep a record of the path with the shortest distance seen so far.

Now suppose we're evaluating a new path. We're adding up the distances, and somewhere in the middle we discover that the distance for this path is already larger than the minimum. Since the current path cannot be a solution, we should bail immediately.

To explore this I'm going to run simulations using Python. The code below is the first of two parts. Here, we generate 6 x,y points and print their values. The first row is x and the second row is y:

0.142  0.748  0.38   0.98   0.547  0.498 
0.834  0.283  0.242  0.916  0.848  0.432

Since we're going to be using the distances between points repeatedly, we cache their values in a dictionary. Here are its keys and values:

(0, 1) 0.82
(0, 2) 0.64
(0, 3) 0.84
(0, 4) 0.4
(0, 5) 0.54
(1, 2) 0.37
(1, 3) 0.67
(1, 4) 0.6
(1, 5) 0.29
(2, 3) 0.9
(2, 4) 0.63
(2, 5) 0.22
(3, 4) 0.44
(3, 5) 0.68
(4, 5) 0.42

There is a simple function to try one random choice for the path and compute its total distance. We run that 5 times with these results:

2 4 5 1 0 3  
2.998
0 1 2 5 4 3  
2.268
5 2 4 1 0 3  
3.112
5 1 3 4 0 2  
2.446
0 5 2 3 4 1  
2.7

And we plot the points in order to visualize them better, as shown at the top. Next time we'll explore this further.

There is also a function I'm not using yet, that looks at a non-uniform distribution of points. Here's the code:

import time, math, sys
import numpy as np
import matplotlib.pyplot as plt
from itertools import permutations
np.random.seed(1357)

def euclid_distance(A,c1,c2):
    x1,y1 = A[:,c1]
    x2,y2 = A[:,c2]
    return np.sqrt((x1-x2)**2 + (y1-y2)**2)
    
def heavy_tails(N):
    L = list()
    for i in range(N):
        f = np.random.random()
        while 0.2 < f < 0.8:
            f = np.random.random()
        L.append(f)
    return L

def init(N=10):
    D = dict()
    if True:
        A = np.random.uniform(size=N*2)
    else:
        A = [np.random.random() for n in range(N)]
        L = heavy_tails(N)
        A = np.array(A + L)
    A.shape = (2,N)
    for L in A:
        L = [str(round(e,3)).ljust(5) for e in L]
        for e in L:  print e + ' ',
        print
    print 
    for c2 in range(A.shape[1]):
        for c1 in range(c2):
            D[(c1,c2)] = euclid_distance(A,c1,c2)
    return A,D
        
def show_distances(A,D):
    for k in sorted(D.keys()):
        print k, str(round(D[k],2))
    print
        
def plot(A,D):
    plt.scatter(A[0,:],A[1,:],s=350,
                facecolor='0.8')
    for c in range(A.shape[1]):  
        plt.text(A[0,c],A[1,c],s=c,
                 fontsize=18,
                 ha='center',va='center')
    ax = plt.axes()
    ax.set_xlim(-0.02,1.02)
    ax.set_ylim(-0.02,1.02)
    ax.grid(True)
    plt.savefig('example.png')
    
def repr(L,dist):
    rL = [' '.join([str(n) for n in L]) + '  ']
    rL.append(str(round(dist,3)))
    return '\n'.join(rL)
    
def try_random(N,D):
    L = range(N)
    np.random.shuffle(L)
    return repr(L,total_distance(L,D))

def total_distance(L,D,n=None):
    d = 0
    for first,next in zip(L[:-1],L[1:]):
        if first < next:
            d += D[(first,next)]
        else:
            d += D[(next,first)]
        if n and d > n:
            return None
    return d

def partOne():
    N = 6
    A,D = init(N)
    show_distances(A,D)
    plot(A,D)
    for i in range(5):  
        result = try_random(N,D)
        print result

partOne()

How many trees?

One of the central facts (and limitations) in phylogenetic analysis is that the number of possible trees is enormous. Felsenstein discusses this issue extensively in Chapter 3. Consider the simplest example, those trees which do not contain directionality (unrooted trees). There is a single tree for 3 species or OTUs (operational taxonomic units). For N = 4, there are three possible trees, as shown above.

There are several ways to see this result for N = 4. One is to just enumerate the possibilities, and recognize that rotation around the central branch does not change the tree. These two trees are the same:

You can think of a tree as a kind of mobile.

Or, we might consider that A has a particular nearest neighbor in the tree, and there are only three choices for this spot. Once that choice is made, the tree is determined.

A third way is to start with the 3 OTU tree and recognize that we can add a new branch at any one of the three existing branches. Again, we can make three trees for N = 4.

We can extend the last idea. Take one particular tree from N = 4. This tree contains five branches, and on any one of these we can add a new OTU. Thus, for each of the three N = 4 trees, we can produce five new ones for N = 5. Each of these new trees is unique, since for each branch we choose, E has a particular nearest neighbor, and in that set of trees there are various unique combinations.

N                   trees      branches
3                      1           3
4      1 x  3          3           5
5      3 x  5         15           7
6     15 x  7        105           9
7    105 x  9        945          11
8    945 x 11     10,395          13

22              ≈ a mole of trees

Can you see the pattern in the table? If we have the N-species tree, the number of new trees we can make for N + 1 is 1 x 3 x 5 x 7 x 9 x ... (2n - 3).

Tuesday, March 23, 2010

Fitch and Sankoff Algorithms for Parsimony

This example comes from Felsenstein's book Inferring Phylogenies (pp.11-16). Consider a tree with the structure and observed data as shown below, for a particular position in a multiple sequence alignment.

"The Fitch algorithm considers the sites (or characters) one at a time. At each tip in the tree, we create a set containing those nucleotides (states) that are observed or are compatible with the observation. Thus, if we see an A, we create the set {A}. If we see an ambiguity such as R, we create the set {AG}. Now we move down the tree [away from the tips]. In algorithmic terms, we do a postorder tree traversal. At each interior node we create a set that is the intersection of sets at the two descendant nodes. However, if that set is empty, we instead create the set that is the union of the two sets at the descendant nodes. Every time we create such a union, we also count one change of state."

In this example then, the terminal nodes are unambiguous. For N2 and N4 we create the unions N2={AC} and N4={AG}. For N3 we create another union N3={ACG}. For the node N1, we can take the intersection of {AC} and {ACG}, which is N1={C}. We created three unions, so there will be three changes of state in the most parsimonious tree.

Sets do not necessarily contain the ancestral states. For this example, we might assign state C to all of the ancestral nodes. Alternatively, we could assign A to all of them. Both trees would require 3 changes of state. But A is not in the set N1={C}.

The Sankoff algorithm allows the incorporation of varying costs for different nucleotide changes. Felsenstein continues the example with a (symmetric) cost matrix. (I have doubled his costs to keep the example all integers).

A    C    G    T
A   0    5    2    5
C   5    0    5    2
G   2    5    0    5
T   5    2    5    0

For each node we set up a small array of values, corresponding to the states (A C G T) in order---shown in boxes in the figure. For the terminal nodes, we assign cost 0 for the correct state, and infinite cost for the others. Since we're not going to use these others, I will just mark them with a '-'.

Now consider node N2. It might be in any of the four states. For each state, we compute the cost of moving down to its (two) terminal nodes. For example, starting from C there would be a cost of 5 (for the C to A change), and no cost for the top branch, which has no change. The same holds for A. For the other two states there would be two penalties, of which one will be 5 and one (the transition) will be only 2. We complete the calculation with the array (5 5 7 7).

For node N4, the calculation is different because the terminal nodes are both purines. Thus, for an initial C or T, there is a total cost of 10, and for an initial A or G, there is a total cost of 2.

For N3, which also has a terminal node as direct descendant, there is again a small calculation to do. It is slightly trickier than before. If the state of N3 is A, then the cost to change to C in the terminal node is 5 and there is a further cost of 2 which will be exacted while going from node N4=A to the terminal G, which we read from the array of values at N4. The first entry at N3 is then 7. A similar argument holds for G at N3 by symmetry.

On the other hand, if the state at N3 is C, then there is no cost for the terminal node, but we read the corresponding value 10 from the array at N4. This is the cost to stay C at N4. However, we can do better than that. Suppose we change from C to A at N4 (cost 5). Then there is only an additional cost of 2 for the change from A to G, for a total of 7.

The last possibilty at N3 is to have T. In that worst-case scenario we have a cost of 2 for C -> T, 5 for C -> R at N4, and then a further cost of 2 to reach the other purine. The total is 9.

For nodes which do not have any terminal nodes as direct descendants, we simply add the values. So the array for N1 is the sum of the arrays for its descendants (12 12 14 16). This makes it easy to compute the arrays for most interior nodes in a large tree.

If you read the first part of this post from before, you will recognize this as a dynamic programming algorithm. Felsenstein explains that the two algorithms are closely connected, although he resists the temptation to say that Fitch is a "special case" of Sankoff. But he explains that why Fitch works is easy to understand from a consideration of Sankoff.

Read the book to find out more. This text is the standard reference for the field. Felsenstein is almost always very clear, although I have to say that he is occasionally too encyclopedic for my interests. Still, it is highly recommended).

[UPDATE: Seven years later, I have received a comment from a reader who correctly points out an error in this post. Let's see if I can explain what the problem is and how to fix it.

The issue first shows up in evaluating node N3. As before, we write the costs for the states in the order A C G T. The cost matrix is

0 for no change

2 for transitions

5 for transversions (U <-> Y)

The naive approach is as follows. Suppose N3 = A, then the cost to change to C in the terminal node is 5 and *if N4 also has A*, the cost on that branch is the value in the array at N4(A) = 2, giving a total cost of 5 + 2 = 7, which is the first value shown for N3.

But this will only be correct if it is the *minimum* cost for N3 = A.

We must also evaluate the costs for having N3 = A and N4 be something other than A.

A -> A = 0 + 2 = 2

A -> C = 5 + 10 = 15

A -> G = 2 + 2 = 4

A -> T = 5 + 10 = 15

(Here the first value is for the change from N3 -> N4 and the second comes from the N4 array).

Since A -> A is the minimum score we add 2 to the 5 for A -> C at the terminal node to obtain the final value for A at N3 of 7.

We continue in this manner for each array position:

C -> A = 5 + 2 = 7

C -> C = 0 + 10 = 10

C -> G = 5 + 2 = 7

C -> T = 2 + 10 = 12

Pick the minimum as the value for C at N3. Here we see that the cost is lower if we switch to a purine at N4. There is no cost for C -> C at the terminal node.

G -> A = 2 + 2 = 4

G -> C = 5 + 10 = 15

G -> G = 0 + 2 = 2

G -> T = 5 + 10 = 15

The minimum has G at N3. To the 2 we add 5 for A -> C at the terminal node to obtain the final value for A at N3.

T -> A = 5 + 2 = 7

T -> C = 2 + 10 = 12

T -> G = 5 + 2 = 7

T -> T = 0 + 10 = 10

The value for T at N3 should be 12 (counting the cost for T -> C in the terminal node). The figure is in error, as is the text, because I was writing C -> T when I should have written T -> C.

Now we come to the root, N1. Again, we need to find the minimum cost. We evaluate N2 and N3 separately. For N1 -> N2 we have:

A -> A = 0 + 5 = 5

A -> C = 5 + 5 = 10

A -> G = 2 + 7 = 9

A -> T = 5 + 7 = 12

For N1 -> N3 we have

A -> A = 0 + 7 = 7

A -> C = 5 + 7 = 12

A -> G = 2 + 7 = 9

A -> T = 5 + 12 = 17

So the minimum cost for A at N1 is 12, which occurs when N2 and N3 are both A as well.

C -> A = 5 + 5 = 10

C -> C = 0 + 5 = 5

C -> G = 5 + 7 = 12

C -> T = 2 + 7 = 9

C -> A = 5 + 7 = 12

C -> C = 0 + 7 = 7

C -> G = 5 + 7 = 12

C -> T = 2 + 12 = 14

The minimum cost for C at N1 is 12.

G -> A = 2 + 5 = 7

G -> C = 5 + 5 = 10

G -> G = 0 + 7 = 7

G -> T = 5 + 7 = 12

G -> A = 2 + 7 = 9

G -> C = 5 + 7 = 12

G -> G = 0 + 7 = 7

G -> T = 5 + 12 = 17

The minimum cost for G at N1 is 14.

T -> A = 5 + 5 = 10

T -> C = 2 + 5 = 7

T -> G = 5 + 7 = 12

T -> T = 0 + 7 = 7

T -> A = 5 + 7 = 12

T -> C = 2 + 7 = 9

T -> G = 5 + 7 = 12

T -> T = 0 + 12 = 12

The minimum cost for T at N1 is 16.

If you're keeping score at home, you will notice that this is the same result we obtained by simply adding the values from the arrays for N3 and N4.

The counter-example from above had us switching to a purine at N4 to overcome the substantial costs of Y -> A,G = 10 at the terminal nodes.

What it looks like to me after consulting other sources is that, most of the time, simply adding the values from the two arrays works. However, carrying out the full minimization procedure is the correct approach.

Saturday, March 20, 2010

Random walk in 1D

This is a post about 1D random walks. For example, the first graphic is a plot of step number versus distance away from the origin for six independent chains. The code is in the first listing below. I realized when doing this plot that I need to look into generating "rainbow" colors in matplotlib, but I want to go ahead and put this up without that for now. I did a similar example in R in a previous post.

Now, the problem (as posed by Grinstead and Snell) is to investigate what happens at longer times. In the second code example, we run a random walk for a long time (1E7 steps) and record the step number each time the walk returns to 0. The plot shows the number of steps between returns to 0, as a histogram. Both axes are scaled as log₂ of the value.

The walk can wander quite far away from 0 (and for a long time). The list of the largest number of steps between successive visits to 0 is:

The log-log plot looks like some kind of exponential decay. And we always come back. The value for the last interval between visits to 0 is 4.

The code uses the Counter class from here.

import random, sys
from math import log
import numpy as np
import matplotlib.pyplot as plt
import Counter

random.seed(157)
X = range(1000)
colors = list('bgrcmk')
for i in range(len(colors)):
    pos = 0
    L = [pos]
    for x in X[1:]:
        pos += random.choice((-1,1))
        L.append(pos)
    plt.scatter(X,L,s=10,color=colors[i])
    
ax = plt.axes()
ax.set_xlim(-10,X[-1]+10)
plt.savefig('example1.png')

import random, sys
from math import log
import numpy as np
import matplotlib.pyplot as plt
import Counter

random.seed(7)
# 1D walk
pos = 0
L = list()
N = int(1e7)
for i in range(N):
    pos += random.choice((-1,1))
    if pos == 0:  L.append(i)
    
L = [L[i] - L[i-1] for i in range(1,len(L))]
print L[-1]
c = Counter.Counter(L)
for k in sorted(c.keys()):
    print str(k).rjust(8), c[k]

L = [np.floor(log(k)/log(2)) for k in L]
c = Counter.Counter(L)
print c.most_common()

ax = plt.axes()
ax.set_xlim(0,22)
ax.set_ylim(0,8)
for k in c:
    n = log(c[k]/log(2))
    r = plt.Rectangle((k,0),
                      width=1,height=n,
                      facecolor='magenta')
    ax.add_patch(r)

D = {'fontsize':'large'}
ax.set_ylabel('log n',D)
ax.set_xlabel('log steps between 0',D)
plt.savefig('example2.png')

Tuesday, March 16, 2010

Red card

Another set of videos from the Machine Learning Summer School 2009 features David MacKay on Information Theory (link). He poses a problem in probability which I believe is famous, but I can't find the reference now...

[UPDATE: The Box Paradox (Grinstead & Snell, p. 181)]

In his formulation, you are shown three cards. One is red on both sides, while the second is white on both sides, and a third one is red on one side and white on the other. I pick a card at random (out of sight) and show you the "up" side, which is red. What is the probability that the other side is also red?

In the class, about 50 said 1/2 and about 70 picked 2/3.

One way to solve the problem is to consider the "sample space" carefully.

Another is to write a simulation of the cards in Python. This seems to fit quite naturally with a "Card" class. We simulate a small number of draws (10⁵), and get the answer trivially (result below).

However, MacKay suggests that we solve the problem by the application of Bayes Theorem. That took me a bit to puzzle through, but after a while I came up with the following. First of all, remember Bayes' rule. We compute:

P(H|D) = P(D|H) P(H) divided by Σ P(D|H*) P(H*) for all hypotheses

What is our hypothesis? Let's say it is the event that we drew the card with two red sides, so that the side that is yet unseen is also red. What is the prior for this hypothesis? It seems uncontroversial to use the fraction of the cards which fit this hypothesis:

P(H) = 1/3

The data (or datum) is clear: one card with a red side up. What is the probability of the data we observed, given this hypothesis? Since the card is red on both sides, we have:

P(D|H) = 1

What is the marginal probability of the data, the probability under all hypotheses? This normalizing constant is:

Σ P(D|H*) P(H*) =
1/3 * 0 + 1/3 * 1/2 + 1/3 * 1 = 1/2

[UPDATE: Realized that I used the symbol * in the last line for multiplication, in the usual Python way. But H* in the line above uses * as a wildcard, to mean all H. ]

P(H|D) = P(D|H) P(H) / sum

= 1 * 1/3 divided by 1/2

= 2/3

The simulation prints:

$ python Card.py 
{'r': 33465, 'w': 16681}

Here's the code:

import random
class Card:
def __init__(self,*args):
self.colors = args
def __repr__(self):
return ''.join(self.colors)
def sideup(self):
self.up = random.choice(self.colors)
return self.up
def sidedown(self):
i = self.colors.index(self.up)
if i:  return self.colors[0]
return self.colors[1]

L = [Card('r','r'),Card('r','w'),Card('w','w')]
countDict = { 'r':0, 'w':0 }
for i in range(int(1E5)):
c = random.choice(L)
if c.sideup() == 'r':
reverse = c.sidedown()
countDict[reverse] += 1
print countDict

Monday, March 15, 2010

Sampling a distribution

I was listening to an introductory talk about MCMC (link). I'll have more to say about the talk another time (hint: I think it's terrific). The speaker described a method for sampling from a probability distribution. The idea is to generate random samples from U[0,1], and then find the value of x at which the cumulative density function first exceeds the random number. The resulting samples of x are plotted as a histogram.

The Python script has three parts. In the first section we set up a weighted mixture of three normal distributions of different mean and sd, and plot that in red using x increments of 0.01. In the second section the same values are used to generate a discrete cdf for the same points. This is plotted in blue (after normalizing by the interval size).

Then we implement the algorithm for sampling. Binning for the histogram used the collections.Counter class that will be in Python 2.7. I don't have that, so I grabbed the class from here.

You might notice that the labels on the y-axis also cover the histogram. It would be better to have two separate plots, but I'm not quite sure how to handle that yet.

[UPDATE: more here]

import math, sys
import numpy as np
import matplotlib.pyplot as plt
import Counter

def normal(mu,sigma):
    def f(x):
        z = 1.0*(x-mu)/sigma
        e = math.e**(-0.5*z**2)
        C = math.sqrt(2*math.pi)*sigma
        return 1.0*e/C
    return f

p1 = normal(0,2)
p2 = normal(10,1)
p3 = normal(18,0.5)

# sum of weighted normal distributions
@np.vectorize
def p(x):
    return 0.5*p1(x) + 0.25*p2(x) + 0.25*p3(x)

dx = 0.01
xmax = 25    
R = np.arange(-10,xmax+dx,dx)
# dashed lines
plt.plot((R[0],R[-1]),(0,0),color='k',ls=':',lw=2)
plt.plot((R[0],R[-1]),(1,1),color='k',ls=':',lw=2)
plt.plot((R[0],R[-1]),(-0.5,-0.5),color='k',ls=':',lw=2)
L = p(R)
plt.plot(R,L,color='r',lw=3)
#===========================================
cdf = [L[0]]
for e in L[1:]:
    cdf.append(cdf[-1] + e)
cdf = np.array(cdf)
cdf *= dx
plt.plot(R,cdf,color='b',lw=3)

ax = plt.axes()
ax.set_xlim(-6,xmax)
ax.set_ylim(-0.55,1.05)
#===========================================
def find_first(n,L):
    for i,e in enumerate(L):
        if n < e:  return i
    return len(L)

samples = list()
width = 0.4
f = 1/width
for i in range(1000):
    n = np.random.random()
    # must adjust to actual range
    value = find_first(n,cdf)*dx - 10.0
    # trick to truncate at fractional values
    samples.append(np.floor(f*value)/f)

#samples = np.array(samples)
c = Counter.Counter(samples)
maxn = c.most_common(1)[0][1]
for k in c:
    n = c[k]
    n = n * 0.45 / maxn
    r = plt.Rectangle((k,-0.5),
                      width=width,height=n,
                      facecolor='green')
    ax.add_patch(r)
plt.savefig('example.png')

Saturday, March 13, 2010

Peaks of density in numpy

Here is a Python script that generates peaks of user-defined density. It has two main parts. Given an existing matrix and a point x,y (in the interval 0-1), it constructs a matrix of the same shape containing the distance of each element from the designated point.

The second part uses the distance information to construct peaks of density to add to the matrix. The score function is vectorized. For each input element of a distance matrix, it computes a score based on (i) a threshold beyond which the score is zero, (ii) a height for the maximum score, and (iii) a function describing how the density changes with distance.

Nothing fancy, and a bit slow, but it does make it easy to play with the parameters and generate a density like that in the sample. At the end, we need to make sure each value is <= 1.0 and >= 0, because make_heatmap converts these directly to "gray" values to input to the Rectangle function.

Here is output for a 10 x 10 example:

0.04  0.09  0.04  0.04  0.04  0.04  0.04  0.04  0.04  0.04
0.09  0.81  0.09  0.04  0.04  0.05  0.16  0.22  0.16  0.05
0.04  0.09  0.04  0.04  0.04  0.16  0.41  0.56  0.41  0.16
0.04  0.04  0.04  0.04  0.04  0.22  0.56  1.0   0.56  0.22
0.04  0.04  0.04  0.04  0.04  0.16  0.41  0.56  0.41  0.16
0.04  0.04  0.04  0.04  0.04  0.05  0.16  0.22  0.16  0.05
0.04  0.07  0.09  0.07  0.04  0.04  0.04  0.04  0.04  0.04
0.07  0.17  0.28  0.17  0.07  0.04  0.04  0.04  0.04  0.04
0.09  0.28  0.81  0.28  0.09  0.04  0.04  0.04  0.04  0.04
0.07  0.17  0.28  0.17  0.07  0.04  0.04  0.04  0.04  0.04

import matplotlib.pyplot as plt
import numpy as np
import HM_Utils_frac

def init_data(N=30):
    A = np.zeros(N**2)
    A.shape = (N,N)
    return A

def eucl_distance(x0,y0,x,y):
    return np.sqrt((x-x0)**2+(y-y0)**2)

def distance_matrix(A,x,y):
    # matrix is 1.0 x 1.0 in "size"
    # given a matrix A with shape R,C
    # each row,col in M is a fractional pos
    R,C = A.shape
    L = [1.0*c/C for c in range(C)]
    M = list()
    for r in range(R):
        for c in range(C):
            M.append((L[c],L[r]))
    # now compute each dist from given point x,y
    M = [eucl_distance(x0,y0,x,y) for x0,y0 in M]
    M = np.array(M)
    M.shape = A.shape
    return M

@np.vectorize
def score(d,ht,thr,func):
    # given a distance from x,y compute score
    # first, check threshold
    if d > thr:  return 0.0
    d = thr - d
    # scale d to func(d) and height
    return ht* (func(d) / func(thr))

def make_peak(A,x,y,ht,thr,func=None):
    M = distance_matrix(A,x,y)
    if not func:  func = lambda x: x**1.5
    return score(M,ht,thr,func)
    
def scale(A):
    maxN = np.max(A)
    if maxN > 1.0:  A = A/(1.0*maxN)
    return A
    
def f(x):  return x**4

N = 40
ax,fig = HM_Utils_frac.init_fig()
A = init_data(N)
A += make_peak(A,x=0.7,y=0.3,ht=1.0,thr=0.3)
A += make_peak(A,x=0.2,y=0.8,ht=0.8,thr=0.4,func=f)
A += make_peak(A,x=0.1,y=0.1,ht=0.8,thr=0.2,func=f)
A = A + 0.04
A = scale(A)
if N <= 10:  HM_Utils_frac.show_array(A)     
HM_Utils_frac.make_heatmap(A,ax)
plt.savefig('example.png')

Yet another heatmapper

I'm planning to try a small Python MCMC simulation, utilizing a simple 2D grid of densities. I wrote a new heatmap script as a first step. I'll post the code today and then use it in later updates.

One small thing, the original version asked matplotlib for the xlim and ylim and then calculated fractional square dimensions to use. Since it seemed a bit slow, I changed the script so that we tell the Axes object what dimensions we want (the array's shape, naturally). Then there are no fractions and the increment is +1. However, it was not significantly faster at all. Here is the timeit output:

$ python -m timeit -s 'from HM_Utils import demo' 'demo(10)'
10 loops, best of 3: 225 msec per loop
$ python -m timeit -s 'from HM_Utils_frac import demo' 'demo(10)'
10 loops, best of 3: 229 msec per loop

Here's the listing:

import matplotlib.pyplot as plt
import numpy as np

def init_fig():
    fig = plt.figure()
    ax = fig.add_subplot(111)
    ax.xaxis.set_visible(False)
    ax.yaxis.set_visible(False)
    return ax,fig
    
def show_array(A):
    for row in A:
        L = [str(round(n,2)).ljust(4) for n in row]
        print '  '.join(L)
    
# origin is lower left
def make_heatmap(A,ax):
    R,C = A.shape
    ax.set_xlim(C)
    ax.set_ylim(R)
    for r in range(R):
        for c in range(C):
            n = A[r][c]
            assert 0.0 <= n <= 1.0
            p = plt.Rectangle((r,c),
                    width=1,height=1,
                    # black is 0.0, w is 1.0
                    facecolor=str(1-n),
                    edgecolor='w')
            ax.add_patch(p)
            
def demo(N=10):
    ax,fig = init_fig()
    A = np.random.uniform(size=N**2)
    A.shape = (N,N)
    if N <= 10 and __name__ == '__main__':  
        show_array(A)
    make_heatmap(A,ax)
    plt.savefig('example.png')
    
if __name__ == '__main__':
    demo(50)

Friday, March 12, 2010

Selecting array elements in R and numpy

I was a bit confused about the behavior of vectorize in numpy (last post), so I posted a question to SO. Of course, they knew the answer immediately, as I described.

But there is one more thing that came up, and that has to do with the selection of elements from an array. I wasn't aware that you can do indexing by a test that returns a boolean vector in numpy, but you can. In R, it's common to see something like this. Here is a matrix m. We can leave a row out by "-":

> m = 1:9
> dim(m) = c(3,3)
> m = t(m)
> m
    [,1] [,2] [,3]
[1,]    1    2    3
[1,]    4    5    6
[2,]    7    8    9
> m[-1,]
    [,1] [,2] [,3]
[1,]    4    5    6
[2,]    7    8    9

The test m[,1] > 2 asks for all rows in which the first column is greater than 2:

> m[,1]
[1] 1 4 7
> sel = m[,1] > 2
> sel
[1] FALSE  TRUE  TRUE

Note: I screwed this up in the original example. We use the selector to get those rows:

> m[sel,]
    [,1] [,2] [,3]
[1,]    4    5    6
[2,]    7    8    9

How would you do this in Python with numpy?

import numpy as np
A = np.arange(1,10)
A.shape = (3,3)
print A

[[1 2 3]
[4 5 6]
[7 8 9]]

B = A[A[:,0]>2]
print B

[[4 5 6]
[7 8 9]]

sel = A[:,0]>2
print sel

[False  True  True]

print A[sel,:]

[[4 5 6]
[7 8 9]]

It works!

Vectorize in numpy

I'm working on a little project using numpy and matplotlib where I found a need to write my own function that operates on numpy arrays. So, for example, we could have, as a first attempt:

import numpy as np
A = np.arange(0,1,0.1)
def f(n):
    if n > 0.3:  return n
    return 0
print A
print f(A)

$ python vec.py 
[ 0.   0.1  0.2  0.3  0.4  0.5  0.6  0.7  0.8  0.9]
Traceback (most recent call last):
  File "vec.py", line 7, in 
    print f(A)
  File "vec.py", line 4, in f
    if n > 0.3:  return n
ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()

Numpy is trying to tell us that we can't just feed an array to a function that operates on single values. The function has to be "vectorized." Say:

f = np.vectorize(f)
print f(A)

[0 0 0 0 0 0 0 0 0 0]

We didn't get a ValueError, but the result is not correct. What happened? According to the help(np.vectorize):

Data-type of output of vectorized is determined by calling the function with the first element of the input.

The problem is that although the array contains floats, we specified 0 as the value to return for n < 0.3. The first n evaluated was 0.0 which is less than 0.3 and so returns the integer 0. That leads to trouble, since the vectorized function now returns all the rest of the values as ints. Or so the docs imply. (Also see here). There are a couple more issues. First:

A[0] = 2.0
print f(A)
B = np.arange(1,0,-0.1)
print f(B)

[2 0 0 0 0 0 0 0 0 0]
[1 0 0 0 0 0 0 0 0 0]

The output above is obtained if it is run in the same script as the first part. That is, the data type to be returned is only evaluated once, when the function is first called. If the first call is commented out, then we get this:


[ 2.   0.   0.   0.3  0.4  0.5  0.6  0.7  0.8  0.9]
[ 1.   0.9  0.8  0.7  0.6  0.5  0.4  0.3  0.   0. ]

One way to fix the whole problem is to return a float. The following example also uses vectorize as a decorator

@np.vectorize
def f2(n):
    if n > 0.3:  return n
    return 0.0
print f2(A)

[ 0.   0.   0.   0.3  0.4  0.5  0.6  0.7  0.8  0.9]

Continuing in the docs:

This can be avoided by specifying the otypes argument as either a string of typecode characters or a list of data-types specifiers.

We're supposed to do something like:

def f3(n,otypes=[np.float]):
def f3(n,otypes='d'):

But, in my tests, neither of these can compensate for returning the integer 0. The real reason for the otypes argument seems to be that there is a large number of numpy dtypes.

Wednesday, March 10, 2010

Strang's Linear Algebra: least squares

Strang has an excellent section on least squares. I'm going to work through an example here as a way of building my understanding (homework). We take the simplest possible case of three time points equally spaced at t = 1, 2, 3. At these times we observe data y = 2, 2, 4 (to make it slightly different than his example). The three points are not co-linear, as you might guess and can see from the plot. The numpy function polyfit will fit a line to this data (that's what I plotted above), but we're going to find it ourselves.

The line that "best fits" the data has the equation y = C + Dt. We need to find C and D. We use calculus first, then linear algebra. The magnitude of the "error" at times 1, 2 and 3 is the difference between the value on the line and the value we actually observed.

e1 = C + 1D - 2
e2 = C + 2D - 2
e3 = C + 3D - 4

We calculate the sum of the squares of the errors, which we wish to minimize. One reason for the square is that the minimization involves the derivative, and the derivative of the square function is a linear function.

SumSq = (C+D-2)² + (C+2D-2)² + (C+3D-4)²

This is a function of two variables (C and D), so we need to take both partial derivatives and set them both equal to zero, (yielding two equations in two unknowns).

∂SumSq/∂C = 2(C+D-2) +   2(C+2D-2) +   2(C+3D-4) = 0
∂SumSq/∂D = 2(C+D-2) + 2*2(C+2D-2) + 3*2(C+3D-4) = 0

The factor of 2 in each term goes away. Notice that the first equation says that the sum of the errors is equal to zero.

Solve for C in the first eqn:

C + D - 2 + C + 2D - 2 + C + 3D - 4
3C + 6D = 8
3C = 8 - 6D

Substitute C into the second equation:

C + D - 2 + 2C + 4D - 4 + 3C + 9D - 12 = 0
6C + 14D = 18
14D = 18 - 6C 
14D = 18 - 16 + 12D
2D = 2
D = 1
C = 2/3

In Bolstad (p. 237) the equation for the slope is given as:

B = mean(xy) - mean(x)*mean(y) / mean(x²) - mean(x)²

mean(xy) = mean(1*2 + 2*2 + 3*4) = 18/3 = 6
mean(x)*mean(y) = 2 * mean(2,2,4) = 16/3
mean(x²) = mean(1,4,9) = 14/3
mean(x)² = 4

B = 18/3 - 16/3 = 1
    -----------
    14/3 - 12/3

and the intercept is

A_o = mean(y) - B*mean(x) = 8/3 - 2 = 2/3

Looks good. Let's do some linear algebra! We go back to the equation for the line: C + Dt. We have one equation for each observation, which is written like so:

[ 1 1 ] [ C ] = [ 2 ]
[ 1 2 ] [ D ]   [ 2 ]
[ 1 3 ]         [ 4 ]

Ax = b has no solution. We seek the projection of b onto A.

Using the formula from last time, (and Python to do the calculation, below) we obtain:

P = [  5/6  1/3  -1/6 ]
    [  1/3  1/3   1/3 ]
    [ -1/6  1/3   5/6 ]

p = P b = [  5/3 ]
          [  8/3 ]
          [ 11/3 ]

We can check that the error e = b - p


[ 2 ] - [  5/3 ] = [  1/3 ]
[ 2 ] - [  8/3 ] = [ -2/3 ]
[ 4 ] - [ 11/3 ] = [  1/3 ]

is perpendicular to p:

e•p = 5/9 - 16/9 + 11/9 = 0

import numpy as np
import matplotlib.pyplot as plt

t = np.array([1,2,3])
b = np.array([2,2,4])
plt.scatter(t,b,s=300,color='r')

ar,by = np.polyfit(t,b,1)
def f(x):  return by + ar*x
T = np.arange(0,3.2,0.1)

plt.plot(T,f(T),lw=4,color='k')
plt.text(0.2,3.5,fontsize=24,
    s='y-intercept =' + str(round(by,3)))
plt.text(0.2,3.0,fontsize=24,
    s='slope =' + str(round(ar,3)))

ax = plt.axes()
ax.set_xlim(0,3.2)
ax.set_ylim(0,4.2)
plt.savefig('example.png')

col1 = np.array([1,1,1])
t.shape = b.shape = col1.shape = (3,1)
A = np.hstack([col1,t])
print A

def project(A):
    AT = A.transpose()
    temp = np.linalg.inv(np.dot(AT,A))
    return np.dot(A,np.dot(temp,AT))

P = project(A)
print P
print b
print np.dot(P,P)
print np.dot(P,b)

Tuesday, March 9, 2010

Running Phylip as a process

This post is in the category of a "note to myself." I need to work harder on phylogenetics, and in particular, on maximum likelihood methods. I went to Joe Felsenstein's site to get Phylip.

The current version is 3.69. I downloaded the one compiled for OS X. It contains a directory full of executables with names like dnaml.app. If you double-click one of those, it will bring up a Terminal window and you can run the program interactively, specifying the necessary infile path and desired settings, etc. There is also a symbolic link to the actual executable which is inside the bundle.

I wanted more flexibility than this, for example, to run a program from any directory and have the output written there. As a test, I chose the dnadist program, which calculates distances for an alignment under different models. I put a link into the directory ~/bin:

ln -s ~/Software/phylip-3.69/exe/dnadist.app/Contents/MacOS/dnadist ~/bin/dnadist

Since I have ~/bin on my $PATH, if I'm on the Desktop I can just do:

$ dnadist

But we can do more than this. I would like to run the program as a separate process from Python. In order to do that I need a way to anticipate and respond to the interactive prompts. Phylip provides access, though it is a little clunky. We need to issue a command to the shell something like:

dnadist < responses > screenout &

We invoke the program with our input placed in a file. It doesn't seem to take an infile name on the command line in this mode. So we put that path (and it has to be a full path) on the first line of the file responses. The sequence file format must be .phy format. Luckily clustal will write this.

We need to anticipate that if the file 'output' exists, we'll be asked whether to remove it ('R'). And the various parameters get toggled in a funny way:

Settings for this run:
  D  Distance (F84, Kimura, Jukes-Cantor, LogDet)?  F84
  G          Gamma distributed rates across sites?  No
  T                 Transition/transversion ratio?  2.0
  C            One category of substitution rates?  Yes
  W                         Use weights for sites?  No
  F                Use empirical base frequencies?  Yes
  L                       Form of distance matrix?  Square
  M                    Analyze multiple data sets?  No
  I                   Input sequences interleaved?  Yes
  0            Terminal type (IBM PC, ANSI, none)?  ANSI
  1             Print out the data at start of run  No
  2           Print indications of progress of run  Yes

  Y to accept these or type the letter for one to change

Interactively, we would input 'D' twice in a row to progress to the third option. Finally, we need to do 'Y' to accept all the options. Here is a Python script to do all. Invoke it with 'python script.py' as usual.

import os,subprocess

fn = 'capno1.phy'
prog = 'dnadist'

responses = '.temp.txt'
FH = open(responses,'w')
current_dir = os.getcwd()
FH.write (current_dir + '/' + fn + '\n')

outfile = os.path.exists(current_dir + '/outfile')
if outfile:  FH.write('R\n')

FH.write('D\n')
FH.write('D\n')
FH.write('Y\n')
FH.close()

cmd = prog
cmd += ' < ' + responses
cmd += ' >  screenout &'

p = subprocess.Popen(cmd, shell=True)
pid,ecode = os.waitpid(p.pid, 0)
print ecode