Python for Bioinformatics

Friday, August 27, 2010

Mortgage simulation

This is a post describing a simple simulation that uses Python. I'm going to put it here even though it doesn't involve bioinformatics, since I only have the one blog.

Recently, I got involved in a real estate transaction. The problem is to calculate the monthly payment needed given the amount of principal borrowed (P), the monthly interest rate (r), and the length of the loan or number of payments (n). Since the payments are monthly but the interest rate is typically provided as a yearly percentage, that percentage must be divided by 1200 to give the monthly rate. Each turn of the cycle, interest on the current balance is paid first, and then the remainder of the payment is applied to principal.

Rather than figure out the formula myself, it was easy enough to look it up in wikipedia.

r = 4.25/1200.0
n = 360
P = 250000
d = 1 - (1/(1+r))**n
A = P * r / d
print '%3.2f' % A

This Python script calculates that for a loan amount of 250 thousand dollars, an annual rate of 4.25%, and a term of 30 years, the payment needed is $1229.85. About 72% of the first payment is interest. I'm happy to report that the loan broker got the right answer.

A more laborious way to do the same calculation is to simulate the monthly payment cycle. Something like this code fragment:

while current_balance - monthly_payment > 0:
    num_payments += 1
    interest = current_balance * monthly_interest_rate
    current_balance -= (monthly_payment - interest)
payoff = current_balance

and then adjust the monthly_payment variable until num_payments is equal to the desired value.

One advantage of the simulation is that we can investigate the effect of paying an extra sum toward principal each month. Home mortgages typically allow "prepayment" resulting in a quicker payoff and lower total interest payments over the lifetime of the loan, while other types of loan such as a car loan may not. In the example given below, an extra payment of $100 each month shortens the payback period by 50 months.

A second advantage is that the simulation allowed me to model in a simple way the influence of inflation. As you know, inflation refers to a general rise in the price of goods and services. While it is quite painful to talk about what the current rate is (very low) and its meaning for the "recovery", a glance at the monthly and yearly averages over the past decade (range 1.59-3.85 percent) prompted me to estimate the average rate as 3% per year in the U.S., where I live.

I got interested in the effect of inflation because, I thought, the extra dollars paid early in the life of the load should be worth more than the savings obtained by ending the payments early, and so the advantage of pre-payment would be reduced. And this is true. But what I also discovered is that there is a huge effect of inflation on the total cost of the loan and that makes the loan a remarkable bargain.

Calculation of the interest and loan balance is in constant dollars, but we keep track of the "total cost" separately. There the cost of each payment is adjusted for inflation.

Here is the output (code listing at the end):

with inflation
360 payments of    1229.84
final payoff:         7.06
total cost:      291712.05

no inflation
360 payments of    1229.84
final payoff:         7.06
total cost:      442749.46

with inflation
310 payments of    1329.84
final payoff:        22.28
total cost:      265102.21

no inflation
310 payments of    1329.84
final payoff:        22.28
total cost:      381272.68

We calculate the number of payments and total cost for the standard monthly payment, or with an additional $100, either with or without inflation. If you do the math, you'll find that the extra payment regime saves about 61K without inflation and only 26K with inflation. But the big effect is on the total cost of the loan. With the standard payment inflation saves 150K. Wow.

I knew that inflation is bad for investors (one example), but never realized just how good it is for creditors. Two sides of the same coin, I suppose.

[UPDATE: I forgot to mention that I am suspicious about the low total cost with inflation. The method without inflation is easily checked and is correct within roundoff error. ]

initial_balance = 250000
interest = 4.25   # percent
monthly_rate = interest/1200.0
term = 360        # months
inflation = 3.0   # percent
monthly_inflation_rate = inflation/1200.0

def show(month,base,extra,current,total):
    print (str(month) + ' payments of').ljust(15),
    print ('%3.2f' % (base + extra)).rjust(10)
    print ('final payoff:').ljust(15),
    print ('%3.2f' % current).rjust(10)
    print ('total cost:').ljust(15),
    print('%3.2f' % total).rjust(10)
    print

def test(base,extra=0,with_inflation=True):
    month = 0
    current = initial_balance
    total = 0
    inflation_multiplier = 1.0
    
    while current - base - extra > 0:
        month += 1
        inflation_multiplier *= (1.0 / (1.0 + monthly_inflation_rate))
        interest = monthly_rate * current
        current -= base - interest + extra
        cost = base
        if with_inflation:  
            cost *= inflation_multiplier
        total += cost
        
    total += current
    show(month,base,extra,current,total)
        
base = 1229.84
for extra in [0,100]:
    print 'with inflation'
    test(base,extra)
    print 'no inflation'
    test(base,extra,with_inflation=False)

Saturday, June 26, 2010

Summer reading

One important aspect of my thinking and reading that doesn't get much airtime here is policy and politics, mainly because that's not why you're here (if indeed anyone is here). But I have to say something to say about the current controversy regarding a blogger at the Washington Post named Dave Weigel.

I haven't actually read Weigel, and I think some of statements that he made in emails and later got fired for making, even though they were supposed to be private, were in very poor taste if not indefensible. But I am quite simply amazed to see him grouped with Ezra Klein, who I do read every day, as "partisan."

To me, a partisan, at least in the current political climate, is a writer who is not just coming from a particular perspective, but who distorts or misrepresents the facts to argue for policies and politics he likes. (Someone like George Will or David Broder or Charles Krauthammer, say, and more problematic, many of the political writers for both the Times and the Post. These guys invent a story they like, and then try to support it).

Ezra speaks the truth as far he knows it, and if you dispute his facts, I expect him to figure out the truth of it and respond. That is one of the reasons why I like him so much. He's a factual guy. He also has great sources.

The central policy problem of our age is that Republican propaganda (think Fox News or the Washington Post editorial page) is seen as normal political discourse. Lying about the facts does not help move the country forward. That's my belief.

If you're interested, here is my current daily blog reading list:

Matt Yglesias
Ezra Klein
Taegan Goddard
Steve Benen
Kevin Drum
TPM
James Fallows
Tom Ricks
Brad DeLong
Stephen Walt
Bernard Avishai

Wednesday, June 23, 2010

Triangulation in the pentagon

As everyone knows, a pentagon is a 5-sided polygon, while a regular pentagon has all five sides the same length. As shown in the first figure (left), one can generate a pentagon starting with an isosceles triangle whose apical angle α is equal to 360/5 = 72°. Successive rotations of the triangle about the apex by this angle generate a pentagon with the apex as the center.

The other two angles of the triangle are equal to each other (β) and measure (180 - α)/2 = 54°. The total angle at each vertex (θ) is twice that or 108°.

Another calculation (which is really the same one) considers the orientation of a person we imagine walking around the perimeter of the figure. After one complete tour, the walker will have rotated a total of 360°. At each vertex, the amount turned is the same, 360/5 = 72°. The angle at the vertex is the supplement of that, 108°. Both approaches to calculating the angles will work for any regular polygon.

Next, consider the chords of the pentagon, as shown in the same figure, right panel---lines that connect vertices which are not adjacent. Using the rotational and reflective symmetry of the regular pentagon, it is easy to show by similar triangles that chord QT in the second figure is parallel to edge RS. A similar relationship holds for each of the other chords and a corresponding edge.

Label the smaller angle between a chord and adjacent edge as γ. We can show that the two chords originating from each vertex divide the vertex angle θ into three equal parts. Hence the label γ on angle SQT in the figure.

Perhaps the simplest method to show this is algebraic. The triangle PQT is isosceles. The apical angle QPT is a vertex angle of the pentagon so it equals 108°. Thus, the two remaining angles (γ) each equal (180 - 108)/2 = 36°. But TQS equals θ - twice γ, or 108° - 2*36°, which is also equal to 36°. We have:

α = 72°
β = 54°
θ = 2 * β = 108°
γ = 1/3 θ = 36°

Because their base angles are equal, PQT and PQU are similar triangles (second figure, left). So the angle PUQ (labeled δ) is the same as the vertex angle of the pentagon, and is also equal to RUT. This means that the inner five-sided figure is a regular pentagon. (The five-fold rotational symmetry also makes this clear).

There is a special relationship between the parts of each chord, which are labeled a and b in the right panel.

We label the sides of the inner pentagon as b, and the other parts of each chord (which are equal to each other), as a. Recall that QT is parallel to RS. By symmetry then, PT is parallel to QS, and PT is also parallel to VS. This means that the quadrilateral PVST is a parallelogram. Opposing sides of a parallelogram are equal, so we deduce that the length of each edge is equal to a plus b. The parallelogram is actually a rhombus, with all four sides equal.

Now, ignore for a moment the labels a and b. (But remember the result about the rhombus). The two triangles QTW and RSW (next figure, left) are similar because each is an isosceles triangle with base angle γ and apical angle theta;, the vertex angle of the pentagon. We can mentally set the scale of the pentagon so an edge has length 1. Relabel the segment WS with length a as 1/φ and the length of an entire chord as d.

By similar triangles, the ratio of d to 1 (triangle QTW) is equal to the ratio of 1 to 1/φ (triangle RSW).

d/1 = 1/(1/φ) = φ

But QS = QT, so

1 + 1/φ = d
1 + 1/φ = φ

Multiply by φ and rearrange terms:

φ² - φ - 1 = 0

Using the quadratic formula and taking the positive root, we solve for φ:

φ = 1/2(1 + √5)

This is the golden ratio (explaining our choice of symbols!). That is:

a + b    a
----- =  - = φ
  a      b

To construct a regular pentagon, we need to find the center, labeled O in the next figure, below. O is also the center of the circumscribing circle. Knowing that all the vertices lie on that circle, together with the edge length, makes it possible to construct the pentagon.

Finding the center by construction starting from the pentagon is pretty easy. Bisect each of two sides, and draw the corresponding altitudes to the opposite vertex. These altitudes cross at the center.

I found two construction methods to go from a circumscribing circle to the pentagon.

Start with the red circle in the figure, centered at O, containing all five vertices of the pentagon we want to construct. Draw the diameter from vertex P and then the perpendicular diameter to that. Bisect this radius to find point X.

At this point, in the first method, draw the dotted purple circle with radius PX and center X to find point Y. Now draw the dotted magenta circle with radius PY and center P to find point T at the intersection with the red circle. That gives one side of the pentgon, PT.

I found it here under Interactive exercises > Ruler & compass > level 4.

The second method is from wikipedia. Starting from the point X, draw the line PX. Bisect the angle formed with the horizontal diameter and extend the line of angle bisector to where it meets the vertical diameter. Draw the perpendicular to the diameter through this point. This is the chord connecting Q and T.

It is not clear to me yet why either one works. The first construction is connected to the golden ratio result, but I haven't figured it out. When I do, I'll let you know.

Tuesday, April 27, 2010

Duly quoted

I have always imagined that paradise will be a kind of library.

--- Jorge Luis Borges

Sighted on AbeBooks---a favorite site. info on source

Sunday, April 25, 2010

Harmony of the sphere

I've been looking for a good derivation of the volume of a sphere, suitable for students of geometry. I found one in Precalculus Mathematics in a Nutshell by Simmons (Amazon). This focused little book is a perfect summary of pre-calculus math.

The method used is a little sneaky, as it seems to be a Riemann sum in disguise. But it doesn't really feel like calculus, and simply relies on two formulas for series, one for the sum of the integers between 1 and n, and the second for the sum of the squares of the integers:

1 + 2 + .. n = n(n+1)/2
1² + 2² + .. n² = n(n+1)(2n+1)/6

Consider a hemisphere with radius r, as shown in the figure.

We'll divide it up into horizontal slices, figure the volume of each, and then add up the volumes for all of the slices, yielding the volume of the entire hemisphere. Although the volumes for the individual slices are not quite right (because x isn't the same at the top and bottom of the slice), if we make the slices thin enough, this won't matter.

Cut the sphere into n slices. To make the math simpler, we'll throw one away, considering only n-1 of them. If this bothers you, remember that eventually n will become very large, so that n is nearly equal to n-1.

The width of each slice is r/n. Let y = 0 at the top.

Enumerate the slices with the index k. For the kth slice, y = kr/n. And by the Pythagorean theorem:

x² = r² - (r-y)².
  = 2ry - y²
  = 2r(kr/n) - (kr/n)²

The volume of each slice is then:

width times area
r/n πx² =
  = π r/n [2r(kr/n) - (kr/n)²]
  = πr³ [2(1/n)²k - (1/n)³k²]

We need to add up these values for k = 1 to k = n - 1. Take the terms in the brackets one at a time. The first is:

2 (1/n)² (Σk)

Using the standard formula that the sum of the integers from 1 to n = n(n+1)/2, canceling the 2 and remembering that we're actually summing from 1 to n-1:

(1/n)² (n-1) n

Notice that we have just enough n's on the bottom to cancel. For the (n-1) term, dividing by n we get (1 - 1/n) ≈ 1. The entire first term reduces to:

 (1) (1-1/n) = 1

The second term is:

-(1/n)³ (Σk²)

Using the standard formula that the sum of the squares of the integers from 1 to n = n(n+1)(2n+1)/6, and remembering that we're summing from 1 to n-1:

-(1/n)³ (n-1)(n)(2n-1)/6

Again, we have just enough n's on the bottom to cancel.

(1/n)³ (n-1)(n)(2n-1)/6 =
  = (1-1/n)(1)(2-1/n)/6 = 1/3

Putting it all together, we have

πr³ (1 - 1/3) =
2/3 πr³

The complete sphere is twice that:

V = 4/3 πr³

That was painless!

Tuesday, April 20, 2010

Area under the curve

I had a post the other day about the product rule in differential calculus, and how to extend it to find the derivative of a reciprocal and a quotient. The product rule also does heavy duty in the opposite direction, in the method called integration by parts. Strang has a wonderful diagram of that (shown above), which I've reproduced here using matplotlib.

The product rule is:

(uv)' = u v' + v u'

Integrating both sides and isolating u v' we get:

∫u v' = u v - ∫v u'

where for the definite integral we have to remember to evaluate u times v at both of the limits of integration, subtracting the value at the lower limit from the upper one.


∫u v' = u2 v2 - u1 v1 - ∫v u'

Can you see each of these four terms as areas in the diagram? It's very pretty.

from math import sqrt
import numpy as np
import matplotlib.pyplot as plt

@np.vectorize
def f(x):
    return 100*sqrt(x) + 0.2*x**3
# the curve
dx = 0.01
X = np.arange(0,10,dx)
plt.plot(X,f(X),'r',lw=6,zorder=1)

# the points
u1 = 2
u2 = 10
plt.scatter(u1,f(u1),s=250,
    color='k',zorder=2)
plt.scatter(u2,f(u2),s=250,
    color='k',zorder=2)
# u dv
X2 = np.arange(u1,u2+dx,dx)
plt.fill_between(X2,f(X2),0,
    color='0.85',zorder=0)
# v du
YMAX = f(10)
plt.fill_between(X2,YMAX,f(X2),
    color='0.35',zorder=0)
X3 = np.arange(0,u1+dx,dx)
plt.fill_between(X3,YMAX,f(u1),
    color='0.35',zorder=0)
    
plt.text(6,100,s='$\int$ $u$ $dv$',
    fontsize=24)
plt.text(2,300,s='$\int$ $v$ $du$',
    fontsize=24,color='w')
plt.text(0,525,
    s='$y = 100 \sqrt{x}$ + $0.2 {x^3}$',
    fontsize=24,color='r')
plt.text(2.5,115,s='$u_1$,$v_1$',
    fontsize=18)
plt.text(10.5,500,s='$u_2$,$v_2$',
    fontsize=18)

ax = plt.axes()
r1 = plt.Rectangle((0,0),width=u1,
    height=f(u1),facecolor='cyan')
ax.add_patch(r1)
ax.set_xlim(-1,12.2)
ax.set_ylim(-10,600)
plt.savefig('example.png')

Thursday, April 15, 2010

Comparing apples and oranges?

I have a post about Newton's brilliant insight that gravity as observed on earth could be extended to the moon. You can read about this lots of places, I like James Gleick's Isaac Newton (Amazon).

Many years later Newton told at least four people that he had been inspired by an apple in his Woolsthorpe garden--perhaps an apple actually falling from a tree, perhaps not. He never wrote of an apple. He recalled only:
I began to think of gravity extending to the orb of the Moon . . .

---gravity as a force, then, with an extended field of influence; no cutoff or boundary---
& computed the force requisite to kep the Moon in her Orb with the force of gravity at the surface of the earth . . . & found them answer pretty nearly. All this was in the two plague years of 1665-1666. For in those days I was in the prime of my age for invention & minded mathematicks and Philosophy more than at any time since.

The apple was nothing in itself. It was half of a couple---the moon's impish twin. As an apple falls toward the earth, so does the moon; falling away from a straight line, falling around the earth.

---Gleick
Isaac Newton (p. 55)

The force acting on the moon or the apple is proportional to the mass for each, but the acceleration imparted is in turn proportional to that force divided by the mass, which cancels the mass term (for apple and moon).

This leaves an equation relating the acceleration to a constant (G) times the mass of the earth (M) divided by the square of the distance to earth. Thus, we expect that the acceleration (and the distance moved) for the moon and the apple to be in the inverse ratio as the square of their distance from the earth.

a = GM/r²

The key relationship is shown in the diagram.

We draw the tangent to the orbital circle at time-zero at a right angle with the radius. Then the distance moved in one second, s, is the velocity times the time. The hypotenuse, H, is calculated from R and s by the Pythoagorean Theorem. And the distance "fallen" in the direction of the earth is the difference between H and the radius, R.

R = EM distance
H = sqrt (R² + (radial velocity * 1 sec)²)
d = H - R

d is very small compared to R and H, but it still works. Using modern values for the distances:

mean distance from E to M (km) 
R (km)                    = 3.84x10⁵ (range 3.63-4.06)
circumference of M orbit  = 2π 3.84x10⁵ km

mean radius of E (km)     = 6.37x10³ (range 6.36-6.38)
ratio (EM dist/E radius)  = 60.3
ratio²                    = 3636

period of orbit of M
T                         = 27.32 d
                          = 2.36x10⁶ seconds
radial velocity of M  
C/T                   = 1.022 km/sec

H                         = sqrt((3.84x10⁵)² + (1.022)²)
                          = 384000.00000136002
M falls in 1 sec          = 1.36 mm

earth surface gravity     = 9.8 m/s²
you fall in 1 sec         = 1/2 g t²
                          = 4.9 m

ratio                     = 4900 / 1.36 = 3603

I read somewhere that Newton first did this calculation with a value for the moon's orbital radius that was sufficiently off for him to think his result did not agree with it. Later he repeated his calculation with a more accurate value. But I can't find it now. One of these days I have to get a copy of the Principia (Amazon).

Ceva's Theorem and the altitude problem

In this post last summer, I was working on the problem of the orthocenter. It wasn't obvious to me that all three altitudes of a triangle actually cross at a single point (that they are concurrent). This turns out to be true, and that point is called the orthocenter. In the next post, I solved the problem by constructing a small triangle inscribed into a larger one, using the midpoints of the sides of the large triangle as the vertices of the small one. Remarkably, the orthocenter of the small triangle is the circumcenter of the larger one. Because we know that any three points lie on a unique circle, which has a unique center, we deduce that the circumcenter of the large triangle (and orthocenter of the small triangle) exists and it is unique.

I've tried several times to find an algebraic expression for the angles in the first figure of that post, to solve the problem directly, but didn't get anywhere. A few days ago, I came across a theorem attributed to Ceva, which makes the problem easy to solve (though not in the way I posed it). The proof is easy and the theorem is also easy to remember.

Ceva's theorem goes like this:

Pick any point P within the triangle ABC and draw lines through the vertices and P as shown in the figure. P can be any point, so the distances BX and CX are not equal but have some ratio BX/CX = x. Using the standard formula for area of a triangle, and recognizing that the heights are equal in each case, we can show that the areas (designated as |ABC|) of the triangles below are in the same ratio:

|ABX| / |ACX| = x
|BPX| / |CPX| = x

The second identity is illustrated in this diagram:

Therefore the difference is also in the same ratio:

|ABP| / |ACP| = x

As shown in this diagram:

By the same reasoning, if y = CY / AY and z = AZ / BZ, then:

|BCP| / |ABP| = y
|ACP| / |BCP| = z

Noticing that we have the same areas in both numerator and denominator we write:

       |ABP| |BCP| |ACP|
xyz =  ----- ----- -----  = 1
       |ACP| |ABP| |BCP|

That is, if the internal lines are concurrent (all intersect at P), then:

BX CY AZ
-- -- -- = 1
CX AY BZ

It's easy to remember the order of terms, just walk around the circle and fill in the ratios one by one.

The theorem also works in reverse, if this ratio is one, then the internal lines are concurrent. (The proof of this is very simple but a little subtle. See here).

Moving to our problem with the altitudes:

we use a little trigonometry to calculate the heights and bases. Label the angles as α, β and γ, then:

AZ = AC cos α

The ratio is constructed:

AC cos α AB cos β BC cos γ
-------- --------- ---------
BC cos β AC cos γ AB cos α

Since all the terms cancel, the ratio is one and therefore the altitudes are concurrent!

Saturday, April 10, 2010

Chi-squared revisited

I've posted a few times about the chi-square (or -squared) distribution, for example its use to test the distribution of grades between males and females as described in Grinstead & Snell (here). I was also interested more generally in situations where this distribution comes up, for example, when multiplying two random variables that are normally distributed (here).

Probably the simplest example is of rolling dice. If an n-sided die is rolled many times, then the distribution of the counts for each of the n-sides should take on a chi-squared distribution (with df = n-1). I decided to simulate that in Python, with the results as shown in the graphic above. It looks pretty good to me.

Here is the code. Again, I use the Counter class from here.

import random, math
from random import choice
import numpy as np
import matplotlib.pyplot as plt
import Counter
random.seed(157)

class OneDie:
    def __init__(self,nsides=6):
        self.nsides = 6
        self.R = range(1,nsides+1)
    def nrolls(self,N=100):
        return [choice(self.R) for i in range(N)]
        
def chisquare_stat(N,D):
    stat = 0
    e = 1.0 * N / len(D.keys())
    for k in D:
        o = D[k]
        stat += (o-e)**2/e
    return stat

n = 6
d = OneDie(n)
L = list()
N = 100  
for i in range(10000):
    C = Counter.Counter(d.nrolls(N))
    s = chisquare_stat(N,C)
    L.append(s)

# fractional bins for the histogram
width = 0.4
L = [np.floor(s/width) * width for s in L]
C = Counter.Counter(L)

ax = plt.axes()
maxCount = C.most_common(1)[0][1]
maxValue = np.max(C.keys())
ax.set_xlim(-1,maxValue+1)
ax.set_ylim(0,maxCount+5)

# go ahead and plot
for value in C:
    x = value
    r = plt.Rectangle((x,0),width=width,
                      height=C[value],
                      facecolor='green')
    ax.add_patch(r)

# chi-square should approximate gamma
@np.vectorize   
def gamma_pdf(x,k=5):
    return x**(k/2.0-1) * math.e**(-x/2.0)
X = np.arange(0,maxValue,0.01)
Y = gamma_pdf(X,n-1)

plt.plot(X,600*Y,'--',
    color='r',lw=6)
plt.savefig('example.png')

Another fun proof

I've posted here repeatedly about how much I like Strang's Calculus. Here is another example of his incisive and insightful style.

The product rule is a very useful result from differential calculus:

(uv)' = u v' + v u'

A second rule that is somewhat harder to remember applies to quotients:

(u/v)' = (v u' - u v')/v²

Strang has a beautiful proof of the quotient rule.
He starts by deriving a third one---the reciprocal rule. Start with this identity:

v (1/v) = 1

Use the product rule:

d/dx [ v (1/v) ] = 0
v (1/v)' + (1/v) v' = 0
(1/v)' = -v'/v²

Now use the reciprocal rule and the product rule to derive the quotient rule:

d/dx(u/v) = (1/v) du/dx + u d/dx(1/v)
(u/v)' = (1/v) u' - u (1/v)'
(u/v)' = (1/v) u' - u v'/v²
       = (v u' - u v')/v²

Very pretty!

UPDATE: David Jerison uses the chain rule and the power rule to derive the reciprocal rule:

(1/v)' = (v^-1)' = -v^-2 v'

Thursday, April 8, 2010

Sums of sines and cosines

I just watched a couple of videos from the beginning of MIT's ocw course on Calculus. Prof. David Jerison is really excellent, a worthy successor to Gilbert Strang. He derives the formulas for the derivative of sine and cosine, which we touched on here and here.

The derivation depends on the sum rules for sine and cosine:

sin(s+t) = sin s cos t + cos s sin t
cos(s+t) = cos s cos t - sin s sin t

It occurred to me that I haven't seen these derived before (or don't remember), so I went to look it up in Strang's Calculus. I also found a beautiful graphical proof on the web here. And I found yet another pretty method due to Euler. So I thought it would be a nice topic for a post.

First, Strang's approach. The figure shows two angles, s and t (s is implicit), plus the difference between them, s-t. We use Pythagoras (of course) to calculate the distance between the two points on the unit circle (x1,y1) and (x2,y2).

x1 = cos s
x2 = cos t
y1 = sin s
y2 = sin t

The square of the distance between them is:

d² = (cos s - cos t)² + (sin s - sin t)²
   = cos² s - 2 cos s cos t + cos² t 
     + sin² s - 2 sin s sin t + sin² t

Adding the sin² + cos² terms:

   = 2 - 2 cos s cos t - 2 sin s sin t

Now, look at the same figure but consider the triangle on the right-hand side formed by the angle s-t.

Imagine that we rotate this triangle so it rests on the x-axis. The two points are now:

x1 = cos (s-t)
x2 = 1
y1 = sin (s-t)
y2 = 0

The square of the distance between them is:

d² = (cos (s-t) - 1)² + (sin (s-t))²
   = cos² (s-t) - 2 cos (s-t) + 1 + sin² (s-t)
   = 2 - 2 cos (s-t)

But the two calculations are for the same distance, so we can equate them, and after canceling the 2 and multiplying by -1/2, we have

cos(s-t) = cos s cos t + sin s sin t

To get the formula for the sum, just substitute -t for t everywhere:

cos(s+t) = cos(s - -t)
         = cos s cos(-t) + sin s sin(-t)
         = cos s cos t - sin s sin t

To get the formula for the sin, remember that sin s = cos (90-s)

sin(s+t) = cos(90-s - t)
          = cos(90-s) cos t + sin(90-s) sin t
          = sin s cos t + cos s sin t

The geometric proof starts with the figure shown below.

We start with a right triangle with angle s, and inscribe the triangle formed by the angle s+t. The figure is scaled so that the hypotenuse of this second triangle is 1. Then, draw another line perpendicular to the hypotenuse of the first triangle, as shown. Label the complement to s+t as u.

Considering the triangle containing s+t, we note that u is the complement for s+t, and considering the large triangle, then u is the complement for the angle immediately adjacent plus s, so that the adjacent angle is equal to t as shown in the figure.
Simple trigonomtetry allows us to calculate the lengths of three sides as shown in the figure below.

Consider the segment labeled x:

sin t / x = tan s
    = sin s / cos s

x = sin t cos s / sin s

Finally, using the large triangle:

sin s = sin(s+t)
        ---------
        cos t + x

sin(s+t) = sin s cos t + sin s x
         = sin s cos t + sin t cos s

Very pretty!

The last method uses Euler's formula:

cos x + i sin x = e^ix

cos(s+t) + i sin(s+t) = e^i(s+t)
                      = e^is e^it
                      = (cos s + i sin s)(cos t + i sin t)
                      = (cos s cos t - sin s sin t) 
                        + i(sin s cos t + sin t cos s)

Now, for the magic. We have two complex numbers which are equal, therefore both the real and imaginary parts must be equal! Thus:

cos(s+t)   = cos s cos t - sin s sin t

i sin(s+t) = i(sin s cos t + cos s sin t)
sin(s+t)   = sin s cos t + cos s sin t

It's two-for-one here.

Thursday, April 1, 2010

Visualizing UPGMA clustering

This post is a continuing exploration of UPGMA from a post the other day, using Python. To make it easy to visualize, the data are points in two dimensions, and the distances are euclidean.

The example code contains a Point class with a weight attribute, which is the number of points "contained" within it. The primary data points have weight 1, the combination of two primary points has weight 2, and so on.

The cluster function finds the two points that are the closest together, and averages them together, using the weights.

The plot is shown above with the primary data points in blue. I added the arrows afterward to indicate which points have been averaged. You can clearly see the effect of weighting on points 'H' and 'I'.

One thing I'm puzzling about is the conflict between the weighting that I'm doing (as described by Felsenstein) and the U in the name UPGMA. I'll have to look into this to check if I'm really doing it correctly.

[UPDATE: Professor Felsenstein has an extensive comment to this post and notes that the method used here is not UPGMA but another one. The problem is that I calculate the weighted average of (x,y) for a set of points and then compute the distance to the resulting virtual point. Although this is similar to the average of the distances when the targets lie roughly in the same direction, one can easily imagine a situation in which the results are wildly different (as he details).

I should modify this code to keep a list of its component points and do a real UPGMA calculation. I'll try to remember to do that. Thanks for reading!]

import string
import numpy as np
import matplotlib.pyplot as plt
UC = list(string.uppercase)
    
class Point:
    def __init__(self,x,y,s=None,w=None):
        self.x = float(x)
        self.y = float(y)
        if not s:
            self.name = UC.pop(0)
        else:
            self.name = s
        # weight = number of points included
        if not w:
            self.w = 1
        else:
            self.w = w
    def dist(self,b):
        ssq = (self.x-b.x)**2 + (self.y-b.y)**2
        return np.sqrt(ssq)
    def __repr__(self):
        pos = ','.join(['x = '+str(self.x),
                       ' y = '+str(self.y),
                       ' w = '+str(self.w)])
        return self.name + ' : ' + pos
        
def init():
    L = list()
    X = [8,18,30,30,3]
    print 'X', X, sum(X)/len(X)
    Y = [5,18,10,15,10]
    print 'Y', Y, sum(Y)/len(Y)
    for x,y in zip(X,Y):
        p = Point(x,y)
        L.append(p)
    return L

def plot(L,col='b'):
    X = [p.x for p in L]
    Y = [p.y for p in L]
    plt.scatter(X,Y,s=250,color=col)
    
def show(L):
    for p in L:  print p
    print

def closest(L):
    dist,p1,p2 = 1e6,-1,-1
    for i in range(len(L)):
        for j in range(i):
            d = L[i].dist(L[j])
            if d < dist:
                dist,p1,p2 = d,i,j
    return p1,p2
    
def cluster(L):
    i,j = closest(L)
    if i > j:  i,j = j,i
    p1,p2 = L[i],L[j]
    print 'cluster'
    print p1
    print p2
    print '-'*20
    rL = L[:i] + L[i+1:j] + L[j+1:]
    x = (p1.x*p1.w + p2.x*p2.w)/(p1.w+p2.w)
    y = (p1.y*p1.w + p2.y*p2.w)/(p1.w+p2.w)
    w = p1.w + p2.w
    p = Point(x,y,w=w)
    rL.append(p)
    return rL,p
        
pL = init()
L = pL[:]
show(L)
plot(pL)
c = 0
colors = ['red','red','magenta','maroon']

while len(L) > 1:
    L,p = cluster(L)
    plot([p],colors[c])
    c += 1
    show(L)
    
ax = plt.axes()
ax.set_xlim(0,32)
ax.set_ylim(0,22)
plt.savefig('example.png')

A kick in the pants?

I worked up another problem from Grinstead & Snell, but it's going to be the last. I don't want to give away all the answers., although I may post about content from the later chapters as I get it figured out. The problem for this post involves the classic example of the Poisson distribution.

"In one of the first studies of the Poisson distribution, von Bortkiewicz considered the frequency of deaths from kicks in the Prussian army corps. From the study of 14 corps over a 20-year period, he obtained the data shown in [the] Table. Fit a Poisson distribution to this data and see if you think that the Poisson distribution is appropriate."

deaths number of corps with x deaths
0 144
1  91
2  32
3  11
4   2

We can think of the results for each corps over a single year as a Bernoulli trial with n very large and p very small and the mean near 1. We calculate the mean as (91 + 2*32 + 3*11 + 4*2) / 280 = 0.7. So λ = 0.7.

The code below prints the values for P(0), P(1) etc, as well as the calculated probability given this λ, and the expected and observed values for the number of corps with x deaths. The output is:

$ python kicks.py 
k       p      E     O
0   0.497 139.04   144
1   0.348  97.33    91
2   0.122  34.07    32
3   0.028   7.95    11
4   0.005   1.39     2

I think the model fits the data pretty well. However, I would like to know the appropriate statistical test for the fit.

import math
import matplotlib.pyplot as plt

def poisson(m):
    def f(k):
        e = math.e**(-m)
        f = math.factorial(k)
        g = m**k
        return g*e/f
    return f
    
p = poisson(0.7)
data = [144,91,32,11,2]
print 'k       p      E     O'
for i in range (5):
    print str(i).ljust(3),
    prob = p(i)
    print str(round(prob,3)).ljust(5), 
    print str(round(prob*280,2)).rjust(6),
    print str(data[i]).rjust(5)

Tuesday, March 30, 2010

Genes not patentable?

In the news (New York Times) is a ruling in a case before the US District Court for the Southern District of New York (Judge Robert W. Sweet) that invalidates Myriad Genetics' patent on the BRCA1 and BRCA2 genes (article, opinion).

For everyone I know (scientists), the idea of patenting genes---particularly unmodified genes---is ludicrous. It exemplifies the nature of our age, which we might call capitalism gone wild. On the other hand, the first lawyer I talked to (who happens to be my brother) explained to me "that's the way the world works." (I'm paraphrasing). Money talks, and all that.

Here is an excerpt of the article that gives the main themes:

“It’s really quite a dramatic holding that would have the effect of invalidating many, many patents on which the biotechnology industry has invested considerable money,” said Rebecca S. Eisenberg, a professor of law at the University of Michigan who has written extensively on gene patents.

The Genomics Law Report, an Internet journal, called the decision “radical and astonishing in its sweep.” It headlined its article, “Pigs Fly.”

Although patents are not granted on things found in nature, the DNA being patented had long been considered a chemical that was isolated from, and different from, what was found in nature.

But Judge Sweet ruled that the distinguishing feature of DNA is its information content, its conveyance of the genetic code. And in that regard, he wrote, the isolated DNA “is not markedly different from native DNA as it exists in nature.”

Executive summary:

P1 says that a lot of money is at stake (implies patents are good)
P2 expresses incredulity: "Pigs Fly"
P3 summarizes Myriad's core scientific proposition

To be clear, the argument (which is complete and total bullshit) is that the DNA in cells is different than that which has been purified away from histones and other stuff, or cloned in bacteria, or amplified by PCR. It's been "transformed," presto, and is now something different enough to be patentable. Trust me, it's a stretch. Particularly to a geneticist who has spent his whole career thinking about genes as sequences. In my view, today is the day the Emperor discovered he is naked.

P4 Well, I include paragraph 4 because it illustrates for me how silly it is for the New York Times to claim to be the "paper of record." Some time ago, I emailed them to explain that it is not correct to talk about sequencing a genome as "decoding" it, or explicating its "genetic code." I wish I still had the message I received in reply. They explained that they were not interested in one scientist's or even consensus opinion about English usage. The editors felt strongly that their readers would not be comfortable with "determined the DNA sequence of X" and would like the sound of "decoded X." Accuracy be damned.

So there you have it. If you have time, read the decision. Yet another example of the intelligence to be seen on the U.S. District Courts of the nation.

Idiots

They call themselves the "paper of record":

FAA, NTSB Investigate Near Mid-Air Crash Over SF
By THE ASSOCIATED PRESS
Published: March 30, 2010

Filed at 8:21 p.m. ET

SAN FRANCISCO (AP) -- Federal investigators are looking into the weekend near collision of a commercial jet and small airplane that came within 300 feet of each other over San Francisco.
...

The closest the planes came to each other was 300 feet vertically and 1,500 feet horizontally, Gregor said.

(from the New York Times)

>>> math.sqrt(300**2 + 1500**2)
1529.7058540778355

Poisson and airline reservations

In the problems for Chapter 5.1, Grinstead and Snell have the following:

An airline finds that 4 percent of the passengers that make reservations on a particular flight will not show up. Consequently, their policy is to sell 100 reserved seats on a plane that has only 98 seats. Find the probability that every person who shows up for the flight will find a seat available.

Attempt #1:
Each seat that is sold is a Bernoulli trial with Prob(no-show) = 0.04.

N = 100, so the expected number of no-shows for a flight = N p = 4.
The variance is N p (1-p) = 3.84, sd = 1.95

If the number of no-shows exceeds ≈ 1 sd, there won't be a seat for someone.
I expect this to happen about 0.5*e^-1 of the time = 0.18.
The desired probability is 1 minus this.

The problem with this is that no-shows are not normally distributed!

Attempt #2:

If we use the Poisson approximation we have

P = λ^k / k! e^-λ
lambda = 4
P(0 no-show) = e^-4 = 0.018
P(1 no-show) = 4 * 0.018 = 0.073

P(nobody gets bumped) = 1 - 0.073 - 0.018 = 1 - 0.091 = 0.919

Attempt #3 is a simulation (Python code below.
Note the number of seats simulated is 10⁸ !).
The program prints the probability that no one gets bumped, and the st dev for the simulated values.

$ python problem28.py 
0.087023 0.00877316767194

import numpy as np
import random

def overbooked():
    rL = list()
    for i in range(100):
        if random.random() > 0.96:  pass
        else: rL.append(1)
    return sum(rL) > 98

N = 1000
L = list()
for i in range(1000):
    S = sum([overbooked() for i in range(N)])
    L.append(S*1.0/N)
A = np.array(L)
print np.mean(A), np.std(A)

Clustering with UPGMA

While there is still much more to talk about with respect to models of sequence of evolution, I'm going to introduce a new topic in phylogenetics today: distance methods of building a tree. The classic example of this was "popularized" by Sokal and Sneath (according to Felsenstein, p. 161). It's called UPGMA (unweighted pair-group method with arithmetic mean). This is a clustering algorithm that uses an average linkage method, and when applied to a set of distances between objects it gives a rooted tree.

Suppose we want to compute the distance between two clusters of objects (ABC and DE). We tally the distances between each pair of items (AD, AE, BE, BE, CD, CE) and average them.

To show the method in practice, let's start with the answer---that usually makes things a lot easier. In the tree at the top, we have six species, OTUs, or just objects listed as A through F in the figure, and also the indicated tree showing the paths by which the OTUs have diverged from each other. Distance on the tree is indicated as different chunks of a unit distance. We derive the following table of distances:

     A    B    C    D    E    F
A    -
B    2    -
C    4    4    -
D    6    6    6    -
E    6    6    6    4    -
F    8    8    8    8    8    -

The method is simply to choose the pair of nodes that are closest, and average them together to construct a new node, and then compute the distance of the new node from all the others. We remove the original nodes from the table.

Here, A and B are closest, so we'll form the new node AB from them. (We remember the distance from A to B was 2). The distance from AB to the other nodes is the average of the distances for A and B. This is the same for all, so we obtain:

    AB    C    D    E    F
AB   -
C    4    -
D    6    6    -
E    6    6    4    -
F    8    8    8    8    -

We can build the tree corresponding to the first two lines of what we just accomplished (AB,C):

For the next step, we'll break a tie by deciding that D and E are closest, so we collapse them into a single node (remembering that the distance from D to E was 4).

    AB    C   DE    F
AB   -
C    4    -
DE   6    6    -
F    8    8    8    -

Now, we need to combine AB and C. For this case, we need to weight the distances computed for the new node using the arithmetic mean of the distance to each component (ABC to DE). However, in this example, it doesn't matter, since they are all the same.

     ABC  DE   F
ABC   -
DE    6    -
F     8    8   -

Finally, we group ABC with DE.

     ABCDE  F
ABCDE  -
F      8    -

The method assumes a molecular clock (that the true distances between the OTUs are proportional to the observed distances between them), which is always the case for a simple clustering application, and may be a reasonable assumption for closely related species.

Felsenstein has a nice (somewhat more complicated) example based on immunological distances (Sarich 1969 PMID 5373919). The reference is from him---as for many older papers, our crappy library does not provide access, so I have not read it. An important reason to publish in "free" journals.

In the code below, the plot of hclust isn't quite working as I'd like. So, instead (looking ahead) I'll show the neighbor-joining tree from this data.

That results in an unrooted tree.

Here is the R code:

m = c(0,2,4,6,6,8,
      2,0,4,6,6,8,
      4,4,0,6,6,8,
      6,6,6,0,4,8,
      6,6,6,4,0,8,
      8,8,8,8,8,0)
dim(m)=c(6,6)
colnames(m)=c(
  'A','B','C','D','E','F')
rownames(m)=colnames(m)
d = as.dist(m)

plot(hclust(d))

library(ape)
tr=nj(d)
colors=c('blue','blue','darkgreen',
  'red','red','maroon')
plot(tr,cex=2,
  tip.color=colors)
axisPhylo()

Duly quoted

One day the parts of the body argued about who ought to be the boss.

The brain said, “I do the thinking around here, so I should be in charge.”
The hands said, “We do all the work around here, so we ought to be in charge.”
The feet said, “If we didn’t carry the body around, nothing could get done, so we ought to be in charge.”

Every part of the body made its claim, until finally the rectum piped up, “I want to be in charge!”

read the rest
--Mark Kleiman

Jukes-Cantor (8) - using the Poisson distribution

In a previous post, I showed a derivation of the equations (from Higgs & Attwood) describing the probability of change at a sequence position as a function of time in the Jukes-Cantor model.

P_XX(t) = 1/4 + 3/4*e^-4*α*t
P_XY(t) = 1/4 - 1/4*e^-4*α*t

These were obtained by starting with this model

  t            Δt
A => [A,C,G,T] => A

and then deriving and solving the resulting differential equation. This seems to be a pretty standard treatment.

But in Chapter 11 of Felsenstein, I found an alternative approach based on the Poisson distribution, which I think is quite wonderful.

To begin, he defines rates in terms of u (= 3*λ), so the individual rates for changes from X to Y are u/3. He says:

The easiest way to compute this is to slightly fictionalize the model. Instead of having a rate u of change to one of the three other bases, let us imagine that we instead have a rate 4/3*u of change to a base randomly drawn from all four possibilities. This will be exactly the same process, as there then works out to be a probability of u/3 of change to each of the other three bases. We have also added a rate u/3 of change from a base to itself, which does not matter.

Now we use the Poisson distribution.

P(k) = λ^k / k! * e^-λ
P(0) = e^-λ

For a branch with time t, the

probability of no events at all at a site, when the number expected to occur is 4/3*u*t, is the zero term of the Poisson distribution…the branch of time t consists then of a vast number of tiny segments of time dt, each having the small probability of 4/3*u*dt of an event.

The probability of at least one event is 1 minus the zero term:

1 - e^-4/3*u*t )

The probability that it is a particular event (say, A to C) is:

1/4*(1 - e^-4/3*u*t )

The probability of change to any one of the three nucleotides is then:

3/4*(1 - e^-4/3*u*t )

Kyte-Doolittle hydrophobicity plot in Python

I had a request the other day to compute a hydrophobicity plot for a protein. The idea (Kyte & Doolittle, PMID 7108955) is to assign a "hydrophobicity" value to each amino acid. We slide a window along a protein and sum the values in the window, then plot the result. The plot shown above suggests the presence of six long hydrophobic stretches in this protein, which could span the membrane and thus mark a transition from "inside the cell" to "outside the cell" (or the reverse).

A   1.8 C   2.5 D  -3.5 E  -3.5
F   2.8 G  -0.4 H  -3.2 I   4.5
K  -3.9 L   3.8 M   1.9 N  -3.5
P  -1.6 Q  -3.5 R  -4.5 S  -0.8
T  -0.7 V   4.2 W  -0.9 Y  -1.3

I found a page at ExPASy that will do this.

It's pretty sophisticated, with 50 or so possible classifications (e.g. alpha-helix and beta-sheet). Since it's a fun (and simple) Python problem, I downloaded their data, as shown above. Most of the code (below) involves parsing the file with the data into a Python dictionary, and reading the file with the protein sequence. The code to scan the sequence is trivial.

As an example, I chose a protein analyzed by my friend Dana Boyd and colleagues, E. coli malG (Boyd 1993 PMID 8419303). They use alkaline phosphatase fusions to check the predictions of the topology model, as shown in this figure from the paper.

The idea with the fusions is that active alkaline phosphatase requires disulfide bonds, which only form in the periplasm (outside the cytoplasm). If we attach (fuse) alkaline phosphatase coding sequence at the points indicated by the arrows, we predict the fusions made at points predicted to be external to the cytoplasm will have high activity, while the internal ones will have low activity. As you can see, the results fit the model.

My plot using matplotlib is shown above.

import string
import numpy as np
import matplotlib.pyplot as plt

D = {'Ala':'A', 'Arg':'R', 'Asn':'N', 'Asp':'D',
     'Cys':'C', 'Gln':'Q', 'Glu':'E', 'Gly':'G',
     'His':'H', 'Ile':'I', 'Leu':'L', 'Lys':'K',
     'Met':'M', 'Phe':'F', 'Pro':'P', 'Ser':'S',
     'Thr':'T', 'Trp':'W', 'Tyr':'Y', 'Val':'V'}
     
fn = 'hydro.data.txt'
FH = open(fn,'r')
data = FH.read()
FH.close()
L = data.strip().split('\n')

HD = dict()
for e in L:
    aa, rest = e.split(':')
    rest = rest.strip()
    n = rest[rest.index('.')-1:]
    n = float(n)
    if '-' in rest:  n *= -1
    HD[D[aa]] = n
    
for i,k in enumerate(sorted(HD.keys())):
    if i and not i % 4:  print 
    print k, str(HD[k]).rjust(5),
print
#=============================================
fn = 'malG.seq.txt'
FH = open(fn,'r')
data = FH.read()
FH.close()
L = [c for c in data if c in string.lowercase]
prot = ''.join(L).upper()
#print prot[:50]
#=============================================
window = 15
delta = 2
rL = list()
for i in range(0,len(prot)-window+1,delta):
    pep = prot[i:i+window]
    L = [HD[aa] for aa in pep]
    rL.append((i, pep, sum(L)))

for e in rL:
    if e[2] > 20:  print e
X = np.arange(len(rL)) * delta
Y = [e[2] for e in rL]

plt.scatter(X,Y,s=40,color='r',zorder=2)
plt.plot(X,Y,'--',lw=1.5,color='k',zorder=1)
plt.savefig('example.png')