Python for Bioinformatics: May 2011

Wednesday, May 18, 2011

Note about the sum of cosines formula

I was showing someone the derivation of the formulas for sums and differences of sines and cosines (my post here). Unfortunately, I have some trouble remembering these. The trick I used was to try to recall that the derivation started by analyzing the difference cos(s-t) and it's a particularly easy form:

cos(s-t) = cos s cos t + sin s sin t

Then it occurred to me that there is a fairly obvious point about this that should make it even clearer. Just remember that pattern is sine sine, cosine cosine, both terms positive.
Then suppose s = t, we have

cos²(s) + sin²(s) = 1

So, which function and for what combination of s with itself would we always get 1? Well, it's obviously the difference, which always equals zero (the sum, 2s, could be any angle). And which function always gives 1 with an argument of 0? The cosine, of course.

cos(s-t) = cos s cos t + sin s sin t

Getting to the formula for cos(s+t) just involves realizing that if we plug in u = -t we have

cos(s+u) = cos s cos(-u) + sin s sin(-u)

but

cos(-u) = cos(u)
sin(-u) = - sin(u)

So it's the sine term in the formula that changes sign when we add.

cos(s+u) = cos s cos u - sin s sin u

As for the other one, perhaps the easiest is Euler:

e^is = cos s + i sin s
e^i(s+t) = cos(s+t) + i sin(s+t)

e^i(s+t) = e^is e^it
  = (cos s + i sin s) (cos t + i sin t)
  = cos s cos t - sin s sin t + i (sin s cos t) + i (cos s sin t)

The real part gives us what we had before,

cos(s+t) = cos s cos t - sin s sin t

and the imaginary part is equal to the imaginary part of the sum from the previous line:

i sin(s+t) = i (sin s cos t) + i (cos s sin t)
sin(s+t) = sin s cos t + cos s sin t

In fact, maybe this is enough by itself. :)

Sunday, May 15, 2011

Law of sines, and cosines

Continuing with some homework, we're going to use vector algebra to prove two geometric theorems.

[This is a similar diagram to the one from last time, but with labels switched around--sorry for any confusion].

We have vectors (a, b and c) and the lengths of the corresponding sides (|a| = a, etc.); also, the angle opposite side a is labeled A and so on.

The law of sines states that the ratio of the length of each side to the sine of the angle opposite is the same:

a/sin(A) = b/sin(B) = c/sin(C)

Recall that the area of the parallelogram formed by a and b is given by the absolute value of the cross-product:

|a X b| = a b sin(C)

And the area of the triangle is one-half that. But we must obtain the same area no matter which two vectors we use to compute the cross-product, and no matter which orientation. Thus:

|c X -a|  = a c sin(B)
|-b X c|  = b c sin(A)

also have the same area.

a b sin(C) = a c sin(B) = b c sin(A)

This leads directly to the law of sines. Now let's relabel the triangle slightly.

If b - a looks a little funny, just consider that

a + b - a = b

The law of cosines states that:

c² = a² + b² - 2 a b cos(C)

where c = |b - a|.
We can obtain this simply by expanding the dot product:

(b - a) • (b - a) =
  = b • b - b • a - a • b + a • a
  = b² + a² - 2 a b cos(C)

But

(b - a) • (b - a) = c²

So finally:

c² = b² + a² - 2 a b cos(C)

Using vectors makes it easy.

Wednesday, May 11, 2011

Ceva using vectors--special case

Some time ago we looked at Ceva's theorem (post). I'm starting on a book about Vector Calculus, and saw this question early in Chapter 1.

Using vectors, prove that the lines from each vertex of a triangle to the midpoint of the opposite side cross at a single point. The picture is as shown below:

We solve this by constructing parametric equations for the midpoint lines. The first one starts at O and travels along the vector a + (b-a)/2, which is the diagonal of the parallelogram formed by a and b. The equation is:

u[a + (b-a)/2]

where u is the parameter. Similarly, construct the line extending from A to its opposing side. It starts from A and travels along the vector b/2 - a. The equation is:

a + v(b/2 - a)

At the point where the vectors cross, these are equal:

u[a + (b-a)/2] = a + v(b/2 - a)

I had a little trouble at this point, and I must confess I peeked at the answer. We can solve this by considering that both the a and the b terms must balance. (Even though a isn't perpendicular to b, the part of b which is perpendicular to a is a constant fraction k of the whole).

This leads to:

½ k ub = ½ k vb
u = v = ⅔

ua - ½ ua = a - va
½ u + v = 1
u = v = ⅔

Given this suggested solution u = v = ⅔ , we can easily verify that

u[a + (b-a)/2]
⅔[a + (b-a)/2]
= ⅔ a + ⅓ b - ⅓ a = ⅓ (a + b)

a + v(b/2 - a)
a + ⅔(b/2 - a)
= a + ⅓ b -⅔ a = ⅓ (a + b)

Now consider the third side. The midpoint line starts from B and travels along the vector a/2 - b. The equation is:

b + w(a/2 - b)

We observe that for w = ⅔ this becomes:

b + w(a/2 - b)
b + ⅔(a/2 - b)
b + ⅓ a - ⅔ b = ⅓ (a + b)

Thus, all three lines intersect at the same point.

I haven't got an extension to the general case just yet (i.e. any point, not just the midpoint).

Friday, May 6, 2011

Flag update

AM   Armenia
BD   Bangladesh
CU   Cuba
DO   Dominican Republic
JO   Jordan
KH   Cambodia
LA   Laos
LB   Lebanon
LV   Latvia
MK   Macedonia
MW   Malawi
NP   Nepal
SY   Syria
UG   Uganda
UZ   Uzbekistan
VG   Virgin Islands (British)

Thanks for reading!

Thursday, May 5, 2011

Why is time?

Watch and listen.

Wednesday, May 4, 2011

Intro to ANOVA

This is an introductory post on ANOVA (analysis of variance). We ask the question: given three (or more) groups of observations, is one or more of the group means significantly different from the others. We will compute an F-statistic, and compare that with an F-distribution (carry out an F-test). If the statistic exceeds the 95% quantile, we will reject the null hypothesis that the means are the same.

According to wikipedia:

the two-group case can be covered by a t-test (Gosset, 1908). When there are only two means to compare, the t-test and the ANOVA F-test are equivalent; the relation between ANOVA and t is given by F = t².

ANOVA is a versatile (and complex) set of methods. This is just an elementary application, where we'll use the R implementation on three simple groups of data, and then compute the result ourselves in Python to see how it works internally.

To begin with, we follow the simple example from Dalgaard. You will need R and the ISwR package (or just construct the "data frame" yourself). R code:

> library(package=ISwR)
> data(red.cell.folate)
> summary(red.cell.folate)
     folate          ventilation
 Min.   :206.0   N2O+O2,24h:8   
 1st Qu.:249.5   N2O+O2,op :9   
 Median :274.0   O2,24h    :5   
 Mean   :283.2                  
 3rd Qu.:305.5                  
 Max.   :392.0                  
> red.cell.folate
   folate ventilation
1     243  N2O+O2,24h
2     251  N2O+O2,24h
3     275  N2O+O2,24h
4     291  N2O+O2,24h
5     347  N2O+O2,24h
6     354  N2O+O2,24h
7     380  N2O+O2,24h
8     392  N2O+O2,24h
9     206   N2O+O2,op
10    210   N2O+O2,op
11    226   N2O+O2,op
12    249   N2O+O2,op
13    255   N2O+O2,op
14    273   N2O+O2,op
15    285   N2O+O2,op
16    295   N2O+O2,op
17    309   N2O+O2,op
18    241      O2,24h
19    258      O2,24h
20    270      O2,24h
21    293      O2,24h
22    328      O2,24h

We have 22 values in 3 groups.

> attach(red.cell.folate)
> anova(lm(folate~ventilation))
Analysis of Variance Table

Response: folate
            Df Sum Sq Mean Sq F value  Pr(>F)  
ventilation  2  15516  7757.9  3.7113 0.04359 *
Residuals   19  39716  2090.3                  
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 
>

The attach gives us access to the names of the columns of values (folate) and factors (ventilation). We make a plot (shown at the top of the post):

> plot( folate ~ ventilation, data = red.cell.folate )
> stripchart(x=folate~ventilation,
  pch=16,vertical=T,add=T,col='blue')

The value 0.04359 indicates that we have P < 0.05.

We write the data to disk, and remember that the groups have (respectively) 8, 9 and 5 values.

> setwd('Desktop')
> write.table(folate,'data.txt',row.names=F,col.names=F)

We use the Python script below to compute the F-statistic:

python script.py
..
MS_W 2090.32
MS_X 7757.88
F_stat 3.71

If you look in the R output above, you'll see the same values as given here for MS_W and MS_X and the F-statistic.

To get the right F-distribution, we need to know that the degrees of freedom are:

k-1 = 2  # k = number of groups
N-k = 19  # N = total observations

Since 3.71 is just higher than the 95% quantile of this F-distribution we can reject the null hypothesis H₀.

I found a calculator online. You can see the results in the screenshot.

The underlying calculation is pretty simple. We compute sumsq, the sum of the squares of the differences from the mean for several sets of values and the relevant means.

For the within groups comparisons, using mathematical notation this is (i groups with j observations in each group):

    Σ         Σ     (x_ij - m_i)²
(over i)  (over j)

In Python, for each group we sumsq for the group compared with the group mean, and add the results for all three groups to give SSD_W.

To carry out the between groups comparisons, we first compute the grand mean m (of all of the samples). Then for each group we compute:

    Σ        Σ     (m_i - m)²
(over i)  (over j)

Since the squared value is the same within each group, this is equivalent to:

    Σ      n_i (m_i - m)²
(over i)

In the Python code this becomes:

len(g)*(mean(g)-m)**2

and sum these over all the groups to give SSD_X. This is a sum of squares of the group means.

Finally, we compute:

MS_W = SSD_W/(N-k)
MS_X = SSD_X/(k-1)
F_stat = MS_X/MS_W

As to why we do this, for now you will have to go read the article. The explanation in Dalgaard is particularly clear, indeed, the whole book is excellent.

There is a SciPy function to carry out ANOVA (stats.f_oneway), but I don't have SciPy installed right now at home, and this post is long enough already. That's for another day.

Python code:

fn = 'data.txt'
FH = open(fn,'r')
data = FH.read().strip().split()
FH.close()

data = [int(n) for n in data]
A = data[:8]
B = data[8:17]
C = data[17:]

#A = [243,251,275,291,347,354,380,392]
#B = [206,210,226,249,255,273,285,295,309]
#C = [241,258,270,293,328]

def mean(L):
    return sum(L)*1.0/len(L)

def sumsq(L):
    m = mean(L)
    print 'sumsq'
    rL = [(x-m)**2 for x in L]
    for n,p in zip(L,rL):
        print n, round(p,1)
    S = sum(rL)
    print 'total', round(S,1), '\n'
    return S

def ANOVA(G):
    # variation within groups
    SSD_W = 0
    for g in G:
        SSD_W += sumsq(g)
    
    # a bit awkward, just flattening the list of lists
    # to get the mean and N
    T = list()
    for g in G:
        T.extend(g)
    m = mean(T)
    
    # variation between groups (X for 'cross')
    SSD_X = 0
    for g in G:
        SSD_X += len(g)*(mean(g)-m)**2
        
    N = len(T)  # 22
    k = len(G)  # 3
    MS_W = SSD_W*1.0/(N-k)
    MS_X = SSD_X*1.0/(k-1)
    F_stat = MS_X/MS_W
    return MS_W, MS_X, F_stat

MS_W, MS_X, F_stat = ANOVA([A,B,C])
print 'MS_W', round(MS_W,2)
print 'MS_X', round(MS_X,2)
print 'F_stat', round(F_stat,2)