Sunday, February 28, 2010

Jukes-Cantor (7)

I wrote a little script that derives the eigenvalues and eigenvectors for a simple Q matrix, and then tests whether U Λ U-1 = Q, as we said last time.

It looks good, except that I haven't been able to actually do the exponentiation part yet.

import numpy as np
import math
a = 0.00333333
Q = np.array(
[[ -3*a, a, a, a],
[ a, -3*a, a, a],
[ a, a, -3*a, a],
[ a, a, a, -3*a]])

evals, evecs = np.linalg.eig(Q)
U = evecs
L = np.diag(evals)
Ui = np.linalg.inv(U)
temp =,L)
P =,Ui)

names = ['Q','evals','evecs','U','L','Ui','P']
for i,v in enumerate([Q,evals,evecs,U,L,Ui,P]):
print names[i]
for row in v:
temp = [str(round(e,3)).rjust(5) for e in row]
print ' '.join(temp)
print row

-0.01 0.003 0.003 0.003
0.003 -0.01 0.003 0.003
0.003 0.003 -0.01 0.003
0.003 0.003 0.003 -0.01


-0.866 0.5 0.269 0.024
0.289 0.5 0.624 0.314
0.289 0.5 -0.182 -0.818
0.289 0.5 -0.711 0.481

-0.866 0.5 0.269 0.024
0.289 0.5 0.624 0.314
0.289 0.5 -0.182 -0.818
0.289 0.5 -0.711 0.481

-0.013 0.0 0.0 0.0
0.0 -0.0 0.0 0.0
0.0 0.0 -0.013 0.0
0.0 0.0 0.0 -0.013

-0.866 0.542 0.235 0.089
0.5 0.5 0.5 0.5
0.0 0.789 -0.102 -0.688
0.0 0.322 -0.811 0.489

-0.01 0.003 0.003 0.003
0.003 -0.01 0.003 0.003
0.003 0.003 -0.01 0.003
0.003 0.003 0.003 -0.01

Jukes-Cantor (6)

I'm exploring the derivation of equations for the probability that a nucleotide either stays the same over a period of evolutionary time, or changes to be one of the other three. The Jukes-Cantor model is that all of these rates (X to Y) are the same, described by a parameter α that is the instantaneous rate.

One of my sources is Ziheng Yang's Computational Molecular Evolution. According to him (and wikipedia), this model can be represented by a substitution-rate matrix which is usually called Q:

     A     C     G     T
A -3λ λ λ λ
C λ -3λ λ λ
G λ λ -3λ λ
T λ λ λ -3λ

Yang uses λ in place of the often-used α. For this post we'll follow along with him.

As I posted about the other day, we need to have equations that will give us the probabilities for any time t. These should be recognizable:

PXX(t) = 1/4 + 3/4*e-4*λ*t
PXY(t) = 1/4 - 1/4*e-4*λ*t

Yang calls these p0(t) and p1(t).

The corresponding transition-probability matrix is designated P(t) and it is:


p0(t) p1(t) p1(t) p1(t)
p1(t) p0(t) p1(t) p1(t)
p1(t) p1(t) p0(t) p1(t)
p1(t) p1(t) p1(t) p0(t)

And if we had P for one unit of evolutionary time, then we could do matrix multiplication (see here) to generate P for any time.

Now, Yang (and other sources) also says that:

P(t) = eQ(t)

And I've been trying to wrap my head around that, i.e. matrix exponentiation. Luckily, I've been getting into Gilbert Strang's MIT lectures on linear algebra (here). So I can appreciate what the inverse of a matrix is, such that:

Q Q-1 = I

where I is just the identity matrix of the same size and shape as Q. Now, suppose we can diagonalize Q. That is, we can find an invertible matrix U and a diagonal matrix Λ such that:

Q = U Λ U-1

A diagonal matrix means that the matrix has non-zero values only on the diagonal, e.g.:

λ1    0    0    0
0 λ2 0 0
0 0 λ3 0
0 0 0 λ4

And that means that Λ Λ is simply:

λ12    0    0    0
0 λ22 0 0
0 0 λ32 0
0 0 0 λ42

Look what happens when we do something like:

Q Q = U Λ U-1 U Λ U-1
= U Λ2 U-1

We only have to deal with Λ, and it's straightforward. Now for the real magic. I am reliably informed that this trick works for any function. For example:

eQ(t) = U e&Lambda(t) U-1

where we have:

e&Lambda(t) =

eλ1 0 0 0
0 eλ2 0 0
0 0 eλ3 0
0 0 0 eλ4

And, most impressive of all, U and U-1 are the right and left eigenvectors of Q, and the coefficients of Λ are its eigenvalues! We've looked at that in detail previously(here).

Jukes-Cantor (5)

I am trying to see how the equations in the Jukes-Cantor model of sequence evolution work, and then eventually, extend this to other models. In order to test my understanding, I'll want to work out some practical examples. But I'm not there yet.

What I want to do here is to wrap up something from the first post. There we had two differential equations for the rate of change of a particular nucleotide position:

d/dt(PXX(t)) =  -3*α*e-4αt
d/dt(PXY(t)) = α*e-4αt

And we'd like to express these results in terms of PXX(t) and PXY(t):

PXX(t) = 1/4 + 3/4*e-4αt
PXY(t) = 1/4 - 1/4*e-4αt

Taking the first one, we have

PXX(t) = 1/4 + 3/4*e-4αt
3*e-4αt = 4*(PXX(t) - 1/4)
-3*α*e-4αt = -4*α*(PXX(t) - 1/4)
d/dt(PXX(t)) = α - 4*α*PXX(t)

And for the second

PXY(t) = 1/4 - 1/4*e-4αt
e-4αt = 1 - 4*PXY(t))
α*e-4αt = α - 4*α*PXY(t))
d/dt(PXY(t)) = α - 4*α*PXY(t)

So the slopes are proportional to the probabilities, with an extra term. But the most interesting thing is that the form is the same for both PXX and PXY!

I wasn't expecting this but it makes sense, because at long times we come to equilibrium (the stationary distribution of the Markov chain), and all rates are the same. At time-zero we have PXX = 1 and the rate is -3*α, while PXY = 0 and the rate is α. I think it's OK.

Friday, February 26, 2010

Jukes-Cantor (4)

If we think of evolutionary changes in sequences in a discrete fashion, then we could pick an α that would give, in one unit of evolutionary time, a 1% chance that nucleotide X => nucleotide Y. We can model this by matrix multiplication. We set up an initial Identity matrix and then process it to give the rate matrix T.

     A       C       G       T
A 0.97 0.01 0.01 0.01
C 0.01 0.97 0.01 0.01
G 0.01 0.01 0.97 0.01
T 0.01 0.01 0.01 0.97

A starts out as equal to I, but repeated rounds of T•A yield the results below.

This can easily be modified to change the sequence model, although I haven't shown it here.

However, I notice that there is an inconsistency between this model and the simulation of mutation from here. The simulation showed saturation around 200% mutation, whereas this matrix method is clearly saturated at 100 rounds of 1%. I believe we may be missing a factor of 2, but I don't have any idea how that would be. I will think about it in days to come, but if you can spot the answer please let me know.

[ UPDATE: <whacks self on head> The previous simulation was for 1% mutation in each round, while this one has 3% in each round (1% for each nucleotide). That's the source of the difference! ]

A 1.0 0.0 0.0 0.0
C 0.0 1.0 0.0 0.0
G 0.0 0.0 1.0 0.0
T 0.0 0.0 0.0 1.0

A 0.97 0.01 0.01 0.01
C 0.01 0.97 0.01 0.01
G 0.01 0.01 0.97 0.01
T 0.01 0.01 0.01 0.97

A 0.9412 0.0196 0.0196 0.0196
C 0.0196 0.9412 0.0196 0.0196
G 0.0196 0.0196 0.9412 0.0196
T 0.0196 0.0196 0.0196 0.9412

import numpy as np
nt = 'ACGT'
def show(A,x):
print x
print ' ' + ' '.join(list(nt))
for i,e in enumerate(A):
print nt[i] + ' ',
L = [str(round(n,4)).ljust(7) for n in e]
print ' '.join(L)

I = np.eye(4)
A = I
T = I * 0.96 + 0.01
for i in range(1,101):
A =,A)
R = range(2,10,2) + range(20,101,10)
if i in [1] + R:

A 1.0 0.0 0.0 0.0
C 0.0 1.0 0.0 0.0
G 0.0 0.0 1.0 0.0
T 0.0 0.0 0.0 1.0

A 0.97 0.01 0.01 0.01
C 0.01 0.97 0.01 0.01
G 0.01 0.01 0.97 0.01
T 0.01 0.01 0.01 0.97

A 0.9412 0.0196 0.0196 0.0196
C 0.0196 0.9412 0.0196 0.0196
G 0.0196 0.0196 0.9412 0.0196
T 0.0196 0.0196 0.0196 0.9412

A 0.887 0.0377 0.0377 0.0377
C 0.0377 0.887 0.0377 0.0377
G 0.0377 0.0377 0.887 0.0377
T 0.0377 0.0377 0.0377 0.887

A 0.8371 0.0543 0.0543 0.0543
C 0.0543 0.8371 0.0543 0.0543
G 0.0543 0.0543 0.8371 0.0543
T 0.0543 0.0543 0.0543 0.8371

A 0.791 0.0697 0.0697 0.0697
C 0.0697 0.791 0.0697 0.0697
G 0.0697 0.0697 0.791 0.0697
T 0.0697 0.0697 0.0697 0.791

A 0.5815 0.1395 0.1395 0.1395
C 0.1395 0.5815 0.1395 0.1395
G 0.1395 0.1395 0.5815 0.1395
T 0.1395 0.1395 0.1395 0.5815

A 0.4704 0.1765 0.1765 0.1765
C 0.1765 0.4704 0.1765 0.1765
G 0.1765 0.1765 0.4704 0.1765
T 0.1765 0.1765 0.1765 0.4704

A 0.3965 0.2012 0.2012 0.2012
C 0.2012 0.3965 0.2012 0.2012
G 0.2012 0.2012 0.3965 0.2012
T 0.2012 0.2012 0.2012 0.3965

A 0.3474 0.2175 0.2175 0.2175
C 0.2175 0.3474 0.2175 0.2175
G 0.2175 0.2175 0.3474 0.2175
T 0.2175 0.2175 0.2175 0.3474

A 0.3148 0.2284 0.2284 0.2284
C 0.2284 0.3148 0.2284 0.2284
G 0.2284 0.2284 0.3148 0.2284
T 0.2284 0.2284 0.2284 0.3148

A 0.2931 0.2356 0.2356 0.2356
C 0.2356 0.2931 0.2356 0.2356
G 0.2356 0.2356 0.2931 0.2356
T 0.2356 0.2356 0.2356 0.2931

A 0.2786 0.2405 0.2405 0.2405
C 0.2405 0.2786 0.2405 0.2405
G 0.2405 0.2405 0.2786 0.2405
T 0.2405 0.2405 0.2405 0.2786

A 0.269 0.2437 0.2437 0.2437
C 0.2437 0.269 0.2437 0.2437
G 0.2437 0.2437 0.269 0.2437
T 0.2437 0.2437 0.2437 0.269

A 0.2627 0.2458 0.2458 0.2458
C 0.2458 0.2627 0.2458 0.2458
G 0.2458 0.2458 0.2627 0.2458
T 0.2458 0.2458 0.2458 0.2627

Jukes-Cantor (3)

From last time, we have two equations for sequences changing according to Jukes-Cantor:

PXX(t) = 1/4 + 3/4*e-4*α*t
PXY(t) = 1/4 - 1/4*e-4*α*t

We can look at this as the probability that a single site will change in this way (from X to Y) over time, but we can also look at it as the fraction of a collection of sites that will change. Since Y can be any one of three nucleotides, the total fraction of sites that differ between the ancestral sequence and a present-day descendant sequence is three times PXY(t) or:

PXN(t) = 3/4 - 3/4*e-4*α*t

Two present-day homologs (common ancestor) have effectively evolved for twice the time because there are two stretches of evolution of time t. The proportion of sites that differ is:

p = 3/4 - 3/4*e-8*α*t

The above equation is what we observe when we look at the sequences. However, our estimate of the true distance, or actual number of substitutions per site:

d = 2*t * 3*α = 6*α*t

We usually do not know either α or t individually, but we can say that:

α*t = d/6
p = 3/4 - 3/4*e-8*α*t
4/3*p = 1 - e-8/6*d
d = -3/4 * ln(1 - 4/3*p)

This is what we've been after. These equations relate the actual evolutionary distance to the observed changes and vice-versa.

p = proportion or fraction of sites that are observed to be different
d = distance or actual number of substitutions per site

Their relationship is plotted at the top.

Plot code:

from math import e
import numpy as np
import matplotlib.pyplot as plt

def d(at):
return 6*at

def p(at):
return -0.75*e**(-8*at) + 0.75

at = A = np.arange(0,2.001,0.001)

ax = plt.axes()
ax.set_xlim(0, 1.0)
ax.set_ylim(0, 2.0)

Jukes-Cantor (2)

The figure illustrates at least part of the reason that we need models of sequence evolution. It comes from a very nice book by Page & Holmes.

What I want to do here is to follow the derivation of the equations for PXX and PXY as a function of time, as developed in Higgs & Atwood. This isn't really necessary from a mathematical viewpoint, since we have already guessed the equations, but it's a fun argument.

Consider the following path: we start with an A at some position at time-zero, and after time t + Δt we observe that it is still A, but realize that at a short time prior to the second observation it might have been any nucleotide (since we weren't looking then):

  t            Δt
A => [A,C,G,T] => A

There are four possible paths to get from A to A, through each of the possible intermediates. We sum over the probabilities... We have:

PAA(t + Δt) = α*Δt*(PAC(t))
+ α*Δt*(PAG(t))
+ α*Δt*(PAT(t))
+ (1 - 3*α*Δt)*(PAA(t))

We can expand the last term to:

PAA(t) - 3*α*Δt*PAA(t)

A wee bit o'calculus. We want to know PAA(t + Δt). Since Δt is small, we can take the value of the function at t and correct it by adding the slope of the function (at t) times Δt. That is:

PAA(t + Δt) = PAA(t) + d/dt PAA(t) * Δt

So we substitute this expression for the left-hand side of the first equation and then notice that we can subtract PAA(t) from both sides, leaving:

d/dt PAA(t) * Δt = α*Δt*(PAC(t) + PAG(t) + PAT(t)) - 3*α*ΔtPAA(t)

Since Δt occurs in each term on both sides it cancels (which is really the whole point of this). Also the sum of the three PAX(t) terms is equal to 1 - PAA(t), and so we have:

d/dt PAA(t) = α*(1 - PAA(t)) - 3*α*PAA(t)
= α - 4*αPAA(t)

The rate of change of PAA(t) is proportional to PAA(t), which is pretty obvious when you think about it, and so the form of the equation is an exponential:

PAA(t) = A*e-4*α*t + B

We need the -4α in the exponent, so that it will come out front when we take the derivative (see here).

We evaluate the constants A and B by considering the boundary conditions, namely, at long times PAA(t) = 1/4, so B = 1/4; and PAA(0) = 1, so A + B = 1 and A = 3/4.

PAA(t) = 1/4 + 3/4*e-4*α*t

Since the other three PAX(t) are all equal and also equal to 1 - PAA(t), we have:

3 * PAX(t) = 1 - PAA(t)
= 3/4 - 3/4*e-4*α*t
PAX(t) = 1/4 - 1/4*e-4*α*t

which is just what we said the other day!

Jukes-Cantor (1)

It is widely reported that Jukes and Cantor proposed the first model for sequence evolution (often designated JC69)1. If the probability that a given nucleotide, say A, will change to another nucleotide, T, in a short time dt, is α * dt, then the same rate constant applies to all 12 possible nucleotide interconversions. I say widely reported because the original reference is in a book that is hard to come by, and I would guess that at least some of the people who cite it have not actually seen a copy. If you have an e-copy (doubtful anyone does), I would love for you to send it to me.

We can set this up as a matrix like so:

     A     C     G     T
A -3α α α α
C α -3α α α
G α α -3α α
T α α α -3α

These are instantaneous rates of change (and only work for short times). We would like to get an equation where the probability of change is expressed as a function of time, so that it could be calculated at any time. In fact, that is what I want to do on the blog, to go through the derivation in Higgs & Attwood. But, for the first post, let's start by peeking at the answer:

Consider two generic nucleotides X and Y. Then the formulae are:

PXX(t) = 1/4 + 3/4*e-4αt
PXY(t) = 1/4 - 1/4*e-4αt

We can see that these make intuitive sense, by considering the results they generate for time-zero and also for very long times. At time-zero the exponential term equals one, so we have the probability that the nucleotide X has not changed after zero time is, of course equal to one. At very long times, the second term of both equations becomes zero, and they both become equal to 1/4. At long times, no matter what we started with, we have equal probability of each of four nucleotides.

And finally, we see that at any intermediate time, no matter what the value of e-4αt, since the total of the three PXY terms with e cancels the single PXY term, the total probability is always equal to one, as required.

We can also verify that this equation gives (something very close to) the required instantaneous rate upon differentiation. Remembering that d/dt(ekt) = k*ekt:

d/dt(PXX(t)) =  -3*α*e-4αt
d/dt(PXY(t)) = α*e-4αt

I'm missing the final step. I think what I have to do is compute P(t + dt) - P(t) and that'll be equal to these slopes. I'll report back or if anyone can help me out here, please do.

1. Jukes, TH and Cantor, CR. 1969. Evolution of protein molecules. Pp. 21-123 in H. N. Munro, ed. Mammalian protein metabolism. Academic Press, New York.

Code for the plot:

from math import e
import numpy as np
import matplotlib.pyplot as plt

def pXX(at):
return e**(-4*at) * 0.75 + 0.25

def pXY(at):
return -e**(-4*at) * 0.25 + 0.25

at = A = np.arange(0,2.001,0.001)
plt.plot(at, pXX(at),lw=3,color='r')
plt.plot(at, pXY(at),lw=3,color='b')

ax = plt.axes()
ax.set_xlim(0, 2.0)
ax.set_ylim(0, 1.0)

Why random mutation does not follow the Poisson?

In the simulation of sequence evolution from last time, there is a connection with the Poisson distribution that was also a subject of recent posts (here and here).

Consider an individual position in the sequence, say the first nucleotide. If we ask the question, what is the probability that the initial T has changed to another nucleotide after some number of trials, you can see that this should be described by the Poisson distribution. We have many cycles, and for an individual cycle the probability that the first nucleotide will change is very low (p = 0.01); after 100 cycles, on the average, the mean number of changes is 1. Therefore, we expect the number of unchanged nucleotides to be:

P(k) = λk / k! * e
P(k=0) = λ0 / 0! * e
P(k=0) = e= e-1 = 0.368

So the Hamming distance should be 1 minus 1/e, which is 0.632. But inspection of the plot, or printing out the values for each run:

   5%   0.047  A 252, C 251, G 248, T 249
10% 0.09 A 256, C 246, G 254, T 244
50% 0.366 A 246, C 260, G 248, T 246
100% 0.561 A 254, C 270, G 251, T 225
200% 0.713 A 241, C 280, G 245, T 234
5% 0.05 A 250, C 247, G 248, T 255
10% 0.097 A 245, C 252, G 252, T 251
50% 0.351 A 235, C 270, G 245, T 250
100% 0.557 A 251, C 249, G 258, T 242
200% 0.696 A 228, C 242, G 301, T 229
5% 0.047 A 252, C 243, G 252, T 253
10% 0.094 A 248, C 240, G 258, T 254
50% 0.342 A 262, C 240, G 241, T 257
100% 0.538 A 257, C 239, G 257, T 247
200% 0.692 A 257, C 246, G 258, T 239

shows that the actual value is more like 0.56.

What's going on?

As the number of mutations increases, there is an increase in the probability of targeting a position that was already mutated in a previous round. This is what we were saying before, that the non-linearity is due to this effect. Now we need to recognize that back mutation can occur, restoring the original nucleotide. This depresses the mutation frequency below that predicted from the Poisson distribution. To see that this explanation is correct, we can modify the code to mark the changed bases as lower case.

def JC_rates():
#return { 'A':'CGT', 'C':'AGT','G':'ACT', 'T':'ACG' }
return { 'A':'cgt', 'C':'agt','G':'act', 'T':'acg',
'a':'cgt', 'c':'agt','g':'act', 't':'acg' }

With this single difference, we now obtain the values predicted by the Poisson.

 100%   0.636  A 88, C 93, G 87, T 96
100% 0.634 A 91, C 97, G 86, T 92
100% 0.629 A 95, C 96, G 95, T 85

How about that?

Wednesday, February 24, 2010

Models of sequence evolution (1): a simulation

I want to spend some time working on concepts in phylogenetics, and I'm going to start with models of sequence evolution. But before I do that I set up a simulation to visualize the broad outlines of what's going on. So, here is a basic simulation of sequence changes. The script sets up a sequence of length 1000 (50% GC) and then the sequence undergoes repeated rounds of mutation with a mutation rate of 1% for each round. At each step, we calculate the Hamming distance (the number of changes required to turn the evolved sequence back into the original), and convert it to a fraction of the total sequence length.

I plotted the results from 3 independent runs above. (They have been offset slightly for clarity).

The general shape of the curves shows saturation, as expected. Each plot is approximately linear at low levels of mutagenesis: ten cycles of mutagenesis at 1% per cycle gives a Hamming distance of ≈ 0.1. But by 50 cycles the distance is only about 0.35 and by 100 cycles it is about 0.55. This is a simple result of the fact that in any mutation event, sometimes the target has been mutated before, so it takes a long time before all the targets have been hit. Eventually, we do, and the Hamming distance peaks at 75% of the length, since in this example the model of sequence evolution is to change a nucleotide into one of the other three at equal rates.

The script can be modified to alter the initial GC content or the rates, although it hasn't been made particularly easy to do this. I did some runs with modified conditions and observe that: (i) a GC-content far from 50% is ameliorated quite quickly and (ii) even massively unbalanced rates (e.g. rD['T'] = 'ACG' + 'G' * 100) have modest effects on the nucleotide composition. That conflicts with what I think I remember of the theory, so I'll be interested to return to the issue after some reading.

Here's the code:

import random, sys
import numpy as np
import matplotlib.pyplot as plt
nt = 'ACGT'

def init(GC = 50):
N = 5 # 1000 total bases
seqL = list('CG') * int(N * GC)
seqL += list('AT') * int(N * (100-GC))
return seqL

def JC_rates():
return { 'A':'CGT', 'C':'AGT',
'G':'ACT', 'T':'ACG' }

def mutagenize(Lseq,mrate):
rD = JC_rates()
N = int(mrate / 100.0 * len(Lseq))
X = len(Lseq)
for i in range(N):
j = random.choice(range(X))
nt = Lseq[j]
Lseq[j] = random.choice(rD[nt])

def hamming(s1,s2):
L = [1 for c1,c2 in zip(s1,s2) if c1 != c2]
return sum(L)*1.0/len(s1)

def report(seqL):
pL = [n + ' ' + str(seqL.count(n)) for n in nt]
return ', '.join(pL)

def simulate(seq,N):
L = seq[:]
mrate = 1
Y = list()
R = [5,10,50,100,200]
for i in range(N):
d = hamming(seq,L)
if i+1 in R:
print str((i+1)*mrate).rjust(4) + '% ',
print str(d).ljust(6),
print report(L)
return Y

N = 250
seq = init()
print report(seq)
X = np.arange(1,N+1)
colors = ['red','dodgerblue','purple']
drawing = True

for j in range(3):
Y = simulate(seq,N)
if drawing:
# plot at offset to see more clearly

if not drawing: sys.exit()
# dotted lines
line = plt.plot((0,225),(0.75,0.75),color='k',ls=':')
line = plt.plot((10,10),(0,0.8),color='k',ls=':')
line = plt.plot((50,50),(0,0.8),color='k',ls=':')
line = plt.plot((100,100),(0,0.8),color='k',ls=':')
ax = plt.axes()
ax.set_xlim(-2, 265)
ax.set_ylim(0, 0.8)

Plotting the normal distribution with Python

It is nice to be able to add a plot of the normal distribution on top of another plot, say a histogram of your data. I've done it before from R (here) using code like this (which assumes we have some data in an array M):

plot(function(x) dnorm(x,0,sd(M)),

I wanted to find out how to do this using numpy and matplotlib. It turns out that while R has these functions built-in, numpy doesn't seem to have them. On this page showing an example exactly like what I want to do, the SciPy docs implement the pdf for the normal distribution as if it had come straight out of wikipedia. So, we'll do the same!

Here is a first shot at it, just the normal pdf, without any sample data. If you want "curves", (or what look like curves but are actually very closely spaced "points"), just make the variable dx smaller. The plot code is similar to what we did the other day, with the Poisson distribution.

import math
import numpy as np
import matplotlib.pyplot as plt

def normal(mu,sigma):
def f(x):
z = 1.0*(x-mu)/sigma
e = math.e**(-0.5*z**2)
C = math.sqrt(2*math.pi)*sigma
return 1.0*e/C
return f

X = 2
dx = 0.1
R = np.arange(-X,X+dx,dx)

L = list()
sdL = (0.5,1,2,3)
for sd in sdL:
f = normal(mu=0,sigma=sd)
L.append([f(x) for x in R])
colors = ['r','b','purple']

for c,P in zip(colors,L):

ax = plt.axes()

Monday, February 22, 2010

Replot the Poisson example with Python

I realized that I used R for the plot of the Poisson distribution from the other day, and I decided to redo it with matplotlib.

This might be a first, since I finally figured out how to set the axis limits. I still have had another problem, I would like to plot the lines after the points, but matplotlib does it how it wants, I guess. It doesn't matter which order the plot and scatter calls are in, the points get plotted last. Maybe I'll ask on SO.

[Those guys know everything: the answer is, you just set the zorder. Here is a link to the example code.]

import math
import matplotlib.pyplot as plt

def poisson(m):
def f(k):
e = math.e**(-m)
f = math.factorial(k)
g = m**k
return g*e/f
return f

R = range(20)
L = list()
means = (1,2,4,10)
for m in means:
f = poisson(m)
L.append([f(k) for k in R])
colors = ['r','g','b','purple']

for c,P in zip(colors,L):

ax = plt.axes()

Friday, February 19, 2010

Checking an exponential approximation

In the previous post about the Poisson approximation to the binomial distribution, we said that "since n is large and p is small":
(1-p)n ≈ e-np

I wanted to take a look at the accuracy of the approximation. In the plot at the top, p varies from 0.005 to 0.03 as shown on the x-axis, for values of n in the series 10,30,100,300 (red, blue, purple, gray). The first thing to observe is that, if p is small, the approximation is very good. The error is < 1% for all values of p < 0.25 if n is equal to 10 or 30. However, it is not necessary that n be large. In fact, the error is much worse for n = 100 or 300. Still, for 100 trials with p = 0.01, the error is < 1%.

R code:

f <- function(p,n) {
left = (1-p)**n
right = exp(-p*n)
(right - left)/left }

p = seq(0.005,0.04,by=0.005)

Thursday, February 18, 2010

Poisson approximation to the binomial

I came across a nice introduction to probability that starts from sets (here). It contains a small section extending the binomial distribution to the Poisson, which explains the simplifying assumptions in an intuitive way. I'm going to do something similar here, except that I'm following the derivation in my favorite bioinformatics textbook, Higgs & Attwood.

This will complement a previous post, which explored the approximation of the binomial as the normal distribution.

Say we're carrying out a series of Bernoulli trials, flipping a (possibly unfair) coin, or just throwing a loaded die where the possible results are divided into two classes: success and failure. If the probability of success is the same on any trial, denoted p (and that of failure is q = 1-p), then the probability of a particular sequence like SFSSSFSFSS containing k successes and n-k failures in a total of n trials is

We can gather all the like terms together to give
pk qn-k = pk (1-p)n-k

We can obtain the total probability for all trials having the same number of successes. Multiply the previous result by the number of combinations, the number of ways of picking k successes out of n trials, which I will symbolize as nCk for convenience. This is called "n choose k."
nCk = n! / (n-k)! k!

The complete expression for the binomial distribution is:
P(k) = [n! / (n-k)! k!] pk (1-p)n-k

The Poisson approximation applies to cases where p is small and n is large enough that p/n, the mean number of successes, stays reasonable (like 1 or 2), even as we make n large and p shrinks closer and closer to zero. We will (eventually, below) use the symbol λ for that ratio.

But start by looking at the term due to combinations. By our assumptions n is large and p is small, so that k is reasonable for the cases we are interested in (near the mean and smaller). Suppose k = 2. Then:

n!/(n-k)! = n * n-1 * n-2 * n-3...  ≈ n2
n-2 * n-3...

So, generalizing to k, and also remembering the factor of 1/k! from the original expression, the complete term for the combinations is approximately
nCk ≈ nk / k!

The second simplification comes from the probability term for failures. Separating the exponents:

(1-p)(n-k) = (1-p)n * (1-p)-k

Since n is large and p is small:
(1-p)n ≈ e-np
(1-p)-k ≈ 1

(Note: the first approximation will need some justification, that I'll defer to a later post. Also, it's not clear to me why we can ignore p in the second case but not in the first).

From these two approximations we obtain:
P(k) = nk / k! * pk * e-np

Substitute λ = np to reveal the familiar Poisson distribution:

P(k) = λk / k! * e

One interesting property of the Poisson distribution is that the variance is equal to the mean, so that the curves spread out as λ increases.

R code:

poisson <- function(lambda) {
p <- function(k) {
e = exp(-lambda)
f = factorial(k)
g = lambda**k
g*e/f }

x = 0:20
y = poisson(1)(x)
y = poisson(4)(x)
y = poisson(10)(x)

Thursday, February 11, 2010

Unifrac (8): understanding PCoA---calculate

I'm exploring Unifrac (first and second posts).

In the previous post, I showed code and some output related to our topic for this post. You'll need to generate those numbers to follow along. Of course, you could just paste 'em in:

[[-20 -16 -12  -8  -4   0   4   8  12  16]
[ -2 1 2 2 -1 1 -1 -3 -2 5]]

The whole trick to PCoA is that it can take a "measure of association" between two samples, like a distance. So, let's use the values from before, calculate the distances between them, and then see how well PCoA recovers the original data. The second half of the code listing is at the bottom of the post. In the first segment of this second half, we simply define two functions to calculate the distances between all the points, using either the "manhattan" distance or the "euclidean" distance. We will only use the euclidean version for this analysis. If you want to see the results of the distance calculation, you'll have to run the code (manhattan) or modify it to print the euclidean result.

Next, we want to do the PCoA analysis. I did this two ways (to check myself). First, I looked under the hood at the PyCogent module cogent.cluster.metric_scaling, and I call the same functions used in metric_scaling.principal_coordinates_analysis. These include the following two functions as well as a numpy.linalg routine:

E = metric_scaling.make_E_matrix(eucl)
F = metric_scaling.make_F_matrix(E)
eigvals, eigvecs = eigh(F)

followed by a little massaging of the vectors to generate the output. As you will see if you run the code, the eigenvalues and then the last two rows of my result are:

0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 7.3 36.34
[ 1.9 -1. -2. -1.9 1.2 -0.8 1.3 3.4 2.4 -4.5]
[-18. -14. -10. -6. -2. 2. 6. 9.9 14. 18.1]

The other approach is to call PyCogent's function directly. The eigenvalues and the last two results of the call to

pca = metric_scaling.principal_coordinates_analysis
point_matrix, eigvals = pca(eucl)


-0.0 -0.0 -0.0 -0.0 -0.0 0.0 0.0 0.0 53.28 1320.32
[ 1.9 -1. -2. -1.9 1.2 -0.8 1.3 3.4 2.4 -4.5]
[-18. -14. -10. -6. -2. 2. 6. 9.9 14. 18.1]

This looks pretty good, though I notice that my eigenvalues aren't matching. I'm missing a square root somewhere. I will postpone worrying about this for another day.

>>> 1320.32/53.28
>>> (36.34/7.3)**2

The original matrix is:

[[-20 -16 -12  -8  -4   0   4   8  12  16]
[ -2 1 2 2 -1 1 -1 -3 -2 5]]

I think this is pretty remarkable. We've taken a matrix of distances between our sample points, done a little PCoA mumbo-jumbo, and regenerated something pretty close to the original.

However, I am a bit worried at what look to be bugs in the implementation:

• the original x-values are systematically transformed by + 2
• the original y-values are systematically transformed by * -1

This gives a shift and a reflection of the points, but the relationships revealed by the PCoA are identical to the original.

[UPDATE: <whacks self on head> Duh...I bet that in view of the last line above, if we calculate the distance we'll get the same result as we did for the original matrix. So, of course, there's no reason that PCoA could possibly tell the difference. Presumably, it's just centered its result a little better around (0,0). ]


def delta(n1,n2): return abs(n2-n1)

def manhattan_dist(x1,y1,x2,y2):
return delta(x1,x2) + delta(y1,y2)

def euclidean_dist(x1,y1,x2,y2):
sumsq = delta(x1,x2)**2 + delta(y1,y2)**2
return np.sqrt(sumsq)

eucl = list()
manh = list()
for i in range(len(x)):
for j in range(len(x)):
x1,y1 = A[0][i], A[1][i]
x2,y2 = A[0][j], A[1][j]

manh = np.array(manh)
manh.shape = (N,N)
eucl = np.array(eucl)
eucl.shape = (N,N)
print 'manhattan\n', manh, '\n'
# PyCogent stuff starts here:

# I call their functions
E = metric_scaling.make_E_matrix(eucl)
F = metric_scaling.make_F_matrix(E)
eigvals, eigvecs = eigh(F)
t_eigvecs = eigvecs.transpose()

print 'eigvals'
for e in eigvals: print round(e,2),
print 't_eigvecs'
for row in t_eigvecs:
print np.round(row,decimals=1)

eigval_root = np.sqrt(abs(eigvals))
print 'eigval_root\n'
for e in eigval_root: print round(e,2),

final = t_eigvecs * eigval_root[:,np.newaxis]
print 'my calculation'
for row in final:
print np.round(row,decimals=1)

# They call their functions
print 'PyCogent'
pca = metric_scaling.principal_coordinates_analysis
point_matrix, eigvals = pca(eucl)
print 'eigvals'
for e in eigvals: print round(e,2),
print 'point_matrix'
for row in point_matrix:
print np.round(row,decimals=1)
print 'original matrix (A)\n', A

Unifrac (7): understanding PCoA---setup

I'm exploring Unifrac (first and second posts).

One of the nice aspects of UniFrac is the "multivariate analysis" it can do including PCoA (Principal Coordinates Analysis). I'm trying to understand how this works, and I have two posts to go through it. This first post is similar to stuff we've done before (e.g. here), but I want to try to present a coherent picture. Sorry for the repetition. The code is presented in two parts overall, and then today's section has two parts also. We begin by getting a bunch of integers in x-coordinates over the range from -20 to +20, and in y-coordinates randomly between -5 and +5. We turn those into a 2D array and plot them using matplotlib as red circles.

[[-20 -16 -12 -8 -4 0 4 8 12 16]
[ -2 1 2 2 -1 1 -1 -3 -2 5]]

Next, the array is rotated through 45 degrees and replotted (blue circles).

In the second part (of the first half), we use standard Principal Component Analysis (PCA) on these data points to recover their long dimension (now rotated) back to the x-axis. We calculate the covariance matrix for these two sets of points, and then do np.linalg.eigh(m) to get the eigenvalues and eigenvectors.

[[ 74.08888889 70.35555556]
[ 70.35555556 78.53333333]]
[ 5.9204692 146.70175303]
[[-0.71818168 0.69585564]
[ 0.69585564 0.71818168]]

The eigenvalues come out with the largest one last. We grab the first eigenvector (the last in the array we received), and do,B)

which gives back the x-values we started with.

[-20. -16. -12.  -8.  -4.   0.   4.   8.  12.  16.]

We can do the same with the second eigenvector to get the y-values. We can combine these two operations by re-ordering the eigenvectors:

print 'C'
T2 = np.array([ev1,ev2])
C =,B)
for row in C:
print np.round(row)

[-20. -16. -12. -8. -4. 0. 4. 8. 12. 16.]
[-2. 1. 2. 2. -1. 1. -1. -3. -2. 5.]

In the figure, the C array has been plotted as black cross-bars, which match the original red circles pretty well. Next time, we'll take the same data points and look at them from the perspective of PCoA. We'll also see how PyCogent fares with our data, and how reasonable a picture it can estimate of what we started with.

Here's the code:

import numpy as np
from pylab import *
from cogent.cluster import metric_scaling
plotting = len(sys.argv) > 1

x = range(-20,20,4)
y = np.random.random_integers(-5,5,size=len(x))
A = np.array([x,y])
N = len(x)
print 'A\n', A

n = 1.0/np.sqrt(2)
T = np.array([[n,-n],[n,n]])
B =,A)

if plotting:


m = np.cov(B)
print 'cov\n', m
e = np.linalg.eigh(m)
print 'eigenvalues\n', e[0]
print 'eigenvectors\n', e[1]

ev1 = e[1][1]
print np.round(,B))
ev2 = e[1][0]
print np.round(,B))

print 'C'
T2 = np.array([ev1,ev2])
C =,B)
for row in C:
print np.round(row)

if plotting:

R = (-25,25)

Wednesday, February 10, 2010

Unifrac (6): understanding how PyCogent does PCoA

I'm exploring Unifrac (first and second posts).

Continuing with the example from last time, I am trying to understand how the PCoA analysis works in UniFrac. I have made some progress, with a lot of help from Justin Kuczynski. The theoretical basis for what we're going to do is given in a book called "Principles of Multivariate Analysis" by A. Krzanowski (Amazon, Google Books, p 107), (among other sources).

I don't understand it (yet).

But the basic approach is that we're going to do something similar to what is done in Principal Component Analysis (PCA, see here, for example), however, the mathematical treatment is different. In PCA we have data, values for a number of measurements (x,y,z, up to n dimensions) for each of a number of samples or entities. PCA will shows projections of the data onto two-dimensions (planes), where the planes are chosen so as to maximize the spread of the overall variance in that plane. We're reducing multi-dimensional data to fewer dimensions, while preserving as much of the structure as we can.

And in PCA, again, we make a covariation matrix, and then get the eigenvalues and eigenvectors for that matrix.

In the UniFrac flavor of Principal Coordinates Analysis (PCoA) we have a distance measurement on a number of samples (based of course, on the phylogenetic distance as measured by the UniFrac metric). What Krzanowski shows is how to do something similar to PCA but starting with the distance data, or indeed, any "measure of association."

Without worrying too much further about the individual steps, here they are. First, we recreate the distance matrix from our example:

import numpy as np
from cogent.cluster import metric_scaling
from numpy.linalg import eigh

AB = 0.69276243
AC = 0.6549957
BC = 0.60625465
M = np.array( [0,AB,AC,AB,0,BC,AC,BC,0] )
M.shape = (3,3)
print 'M\n', M, '\n'

[[ 0. 0.69276243 0.6549957 ]
[ 0.69276243 0. 0.60625465]
[ 0.6549957 0.60625465 0. ]]

Now, we have two simple arithmetic operations on the data:

# step 1
E = metric_scaling.make_E_matrix(M)
print 'E\n', E
print 'M**2/-2\n', M**2/-2, '\n'

[[-0. -0.23995989 -0.21450968]
[-0.23995989 -0. -0.18377235]
[-0.21450968 -0.18377235 -0. ]]
[[-0. -0.23995989 -0.21450968]
[-0.23995989 -0. -0.18377235]
[-0.21450968 -0.18377235 -0. ]]

# step 2
F = metric_scaling.make_F_matrix(E)
print 'F\n', F

row_means = np.mean(E,axis=0)
col_means = np.mean(E,axis=1)
overall_mean = np.mean(E)
F2 = E - row_means - col_means + overall_mean
print 'F2\n', F2, '\n\n'

[[ 0.16114818 -0.08905749 -0.07209069]
[-0.08905749 0.14065662 -0.05159913]
[-0.07209069 -0.05159913 0.12368982]]
[[ 0.16114818 -0.08905749 -0.07209069]
[-0.08905749 0.14065662 -0.05159913]
[-0.07209069 -0.05159913 0.12368982]]

# step 3
eigvals, eigvecs = eigh(F)
print 'eigvals\n', eigvals
print 'eigvecs\n', eigvecs

eigval_root = np.sqrt(abs(eigvals))
print 'eigval_root\n', eigval_root

t_eigvecs = eigvecs.transpose()
print 't_eigvecs\n', t_eigvecs
print t_eigvecs * eigval_root[:,np.newaxis]

As you can see, the eigenvalues do not come out in sorted order. I have copied the code that UniFrac uses to massage the vectors from this point, but I'm not sure I understand it yet---and not sure there isn't a simpler way :).

[ 5.20417043e-18 1.80259579e-01 2.45235038e-01]
[[-0.57735027 0.18984739 0.79411878]
[-0.57735027 0.59280334 -0.56147205]
[-0.57735027 -0.78265073 -0.23264672]]

[ 2.28126509e-09 4.24569875e-01 4.95212115e-01]

[[-0.57735027 -0.57735027 -0.57735027]
[ 0.18984739 0.59280334 -0.78265073]
[ 0.79411878 -0.56147205 -0.23264672]]

[[ -1.31708902e-09 -1.31708902e-09 -1.31708902e-09]
[ 8.06034836e-02 2.51686440e-01 -3.32289924e-01]
[ 3.93257240e-01 -2.78047763e-01 -1.15209477e-01]]

The final result bears some resemblance to what fast_unifrac gave us (see last time).

Type Label vec_num-0 vec_num-1 vec_num-2
Eigenvectors A 0.39 0.08 -0.00
Eigenvectors B -0.28 0.25 -0.00
Eigenvectors C -0.12 -0.33 -0.00
Eigenvalues eigenvalues 0.25 0.18 0.00
Eigenvalues var explained (%) 57.64 42.36 0.00

Can you see that in the final output from the script, we have the x-values for each sample in order in the third row of the vector, and the y-values in the second row? The 3rd dimension is in the first row.

While there are still plenty of mysteries, I believe we have at least a handle on what is going on here. That's enough for now.

Unifrac (5): results using PyCogent

Last time, I used a set of simulated sequences and their phylogenetic tree as input to the web interface of Unifrac (the original version). This time, I propose to repeat the analysis on my laptop using PyCogent. (See the PyCogent category in the sidebar for more from me on this, or start with this post). I'm just going to do the analysis here. Next time I want to look the distinction between PCA and PCoA.

Provided that we've done our imports, and loaded both the tree and the environment, the actual analysis runs in a single line. The results are shown first, followed by the code. The commented lines in the code were used in the first implementation based on the usage example, but aren't needed here.

There are a couple of things to notice. As for the rest of PyCogent, objects bearing results act like dicts, and we recover the elements of interest with result['pcoa'] (for example). Second, the distance matrix (a table of UniFrac metrics for pairwise comparisons between each of the three samples) is slightly different than what we got from the web interface. The comparison of A with C gives exactly the same result, but the other two have changed. Therefore, it is not surprising that the PCoA results are different. It is however, a bit of a shock to see that the small change in the ED (Environmental Distance) matrix has rotated the plot through about 170°.

[UPDATE: I forgot to mention that I had altered PyCogent slightly to always do PCoA. I think what you will need to do with this code is to pass an argument to fast_unifrac of modes = ['distance_matrix','pcoa'], but I haven't tested it yet.]

(array([[ 0.        ,  0.69276243,  0.6549957 ],
[ 0.69276243, 0. , 0.60625465],
[ 0.6549957 , 0.60625465, 0. ]]), ['A', 'B', 'C'])
Type Label vec_num-0 vec_num-1 vec_num-2
Eigenvectors A 0.39 0.08 -0.00
Eigenvectors B -0.28 0.25 -0.00
Eigenvectors C -0.12 -0.33 -0.00
Eigenvalues eigenvalues 0.25 0.18 0.00
Eigenvalues var explained (%) 57.64 42.36 0.00

from cogent import LoadTree
#from cogent.parse.tree import DndParser
#from cogent.maths.unifrac.fast_tree import UniFracTreeNode
from cogent.maths.unifrac.fast_unifrac import fast_unifrac

# convert our environment file repr back into a dict
FH = open('environ.txt','r')
data =
L = data.split('\n')
env_dict = dict()
for e in L:
seq_name, sample = e.strip().split()
env_dict[seq_name] = {sample:1}

tr = LoadTree('samples.tree')
#tree_str = tr.getNewick(with_distances=True)
#tr = DndParser(tree_str, UniFracTreeNode)
result = fast_unifrac(tr, env_dict)
print result['distance_matrix']
print result['pcoa']

R code:
x = c(0.39,-0.28,-0.12)
y = c(0.08,0.25,-0.33)

Unifrac (4): Web results

I'm exploring UniFrac in a series of posts (first here). Last time, we produced a set of simulated sequences from three samples, or environments. I ended up fiddling with that a bit more. I changed this line in my code to drop two of the A sequences out (A11 and A12 in the original example). I did this because the initial set did not show any significant difference between samples.

A = {0:5,1:5,2:0,3:0,4:0,5:0,6:1,7:1}

(Because the sequences are numbered in order, in the new set the labels A11 and A12 have been reassigned to the previous A13 and A14). The sequences are organized to make a phylogenetic tree which was written to disk as 'samples.tree'. I also wrote a simple script to process the titles of the sequences and save them in a file named 'environ.txt'. The first line is:

A1 A

Now let's use these two files as input to UniFrac (web interface). We compute the UniFrac metric or distance for each pair of samples. As you can see, we're going to do 1000 randomizations (this requires simple registration with the site):

and evaluate the signficance of the results by repeated randomization of the labels. The raw p-values

need to be corrected by multiplying by the total number of comparisons (Bonferroni correction).

Notice that only A and B are significantly different, although it was a near-run thing with A and C. We download the actual UniFrac values for comparison with what PyCogent gives us. The text version of the file looks like this:

. A B C
A 0 0.666882546144 0.654995703131
B 0.666882546144 0 0.576182797703
C 0.654995703131 0.576182797703 0

We also ask for a PCA (Principal Coordinates Analysis). Here is the plot of the two principal coordinates:

And here are the eigenvalues and eigenvectors in a data file:

pc vector number  1 2 3 
A -0.396005816461 -0.0214251320143 -3.04168679166e-09
B 0.22015648746 -0.276525789753 -3.04168679166e-09
C 0.175849329001 0.297950921767 -3.04168679166e-09

eigvals 0.236212472152 0.165700300462 2.77555756156e-17
percent variation explained 58.7720740039 41.2279259961 6.90587050397e-15

Of course, we can always replot using R:

> x = c(-0.396,0.220,0.176)
> y = c(-0.021,-0.277,0.298)
> plot(x,y,pch=c('A','B','C'),cex=2)

You can make this as fancy as you like.

Tuesday, February 9, 2010

Duly quoted

Randall Munroe (of xkcd) in a Google talk:

If Python is "executable pseudocode",
then Perl is "executable line noise."

Now be nice...

Area plots and filling in with matplotlib

A question on StackOverflow (here) prompted me to look into the function fill_between in matplotlib. One good use of this would be when we have some factor (say bacterial Phyla) that vary in proportions in different samples (say, sequences collected from different sites along the colon). I believe that R calls these "area plots." Another use we've seen before, to fill in the area below a curve.

[UPDATE: I forgot to normalize the values. This only came out right because the sum is < 1 for all y1, y2 values. You know what to do.]


import numpy as np
from pylab import *

N = 5
x = np.arange(0,N)
y1 = np.random.rand(N)
y2 = y1 + np.random.rand(N)
y3 = np.array([1]*N)


x = np.arange(0,N,0.01)
y4 = np.sqrt(x)

Monday, February 8, 2010

Unifrac (3): Simulating sequences

As part of my exploration of Unifrac (first and second posts), I'm going to need some sequences.

I posted about this before, then pulled that post. In fact, I pushed out this very post yesterday, and am now editing heavily. I'm finding a conflict between what's needed to make a good example for showing how the calculations work, and what's needed for a good example in terms of the final result. I've decided to keep this here, but be aware that we may not use these sequences going forward.

This script fetches a bunch of prototype sequences from Genbank. Then it distributes them to three "samples" according to a hard-coded distribution. In the process, the sequences are mutagenized a bit. Finally, the sequences are aligned, rooted (by hand) and then, as shown above, a plot is made in R of the phylogenetic tree.

[UPDATE: I modified the code slightly to give a bit more heterogeneity to the sequences. The code is updated, and here is a plot of the new phylogenetic tree.

Also, I fixed an awkward point about the old code: I have made sure that the tree we use is a real, rooted tree by incorporating Thermotoga as an outgroup.]

Here's the code (R code at the end):

import random, os, sys
from cogent import LoadSeqs, DNA, LoadTree
from cogent.db.ncbi import EFetch
from import align_unaligned_seqs
from import build_tree_from_alignment

def fetch_ncbi_data(ofile,s):
# get the seqs from Genbank
input = [e.split() for e in s.strip().split('\n')]
id_list = [t[0] for t in input]
names = [t[1] for t in input]
ef = EFetch(id=','.join(id_list), rettype='fasta')
data =

# title lines are too long, replace by genus_species
rL = list()
for i,e in enumerate(data.split('\n\n')):
old_title, seq = e.strip().split('\n',1)
new_title = '>' + names[i]
seq = seq[:500]
FH = open(ofile,'w')

def mutagenize(seq, mrate=1):
L = list(seq)
D = { 'A':'CGT', 'C':'AGT', 'G':'ACT', 'T':'ACG' }
N = int(mrate / 100.0 * len(seq))
X = len(seq)
for i in range(N):
j = random.choice(range(X))
nt = L[j]
if not nt in 'ACGT': continue
L[j] = random.choice(D[nt])
return ''.join(L)

def distribute_seqs(ifile,ofile):
# set up our samples
FH = open(ifile,'r')
data ='\n\n')
seqs = list()
for e in data:
title,seq = e.split('\n',1)

outgroup = '>Thermotoga\n' + seqs.pop()

A = {0:5,1:5,2:0,3:1,4:0,5:1,6:1,7:1} # A has lots of Firmicutes
B = {0:0,1:1,2:5,3:5,4:1,5:0,6:1,7:1} # B has Bacteroidetes
C = {0:1,1:0,2:1,3:0,4:5,5:5,6:1,7:1} # C has enterics
dL = [A,B,C]
L = list()

for distr, sample in zip(dL,list('ABC')):
counter = 1
for k in distr:
seq = seqs[k]
n = distr[k]
for i in range(n):
if n == 1: mrate = 5
else: mrate = random.choice((1,2,3))
copy = mutagenize(seq[:],mrate)
name = sample + str(counter)
counter += 1
FH = open(ofile,'w')
L = [seq.toFasta() for seq in L]

def align_seqs(ifile,ofile):
seqs = LoadSeqs(ifile, moltype=DNA, aligned=False)
aln = align_unaligned_seqs(seqs, DNA)
return aln

def get_tree(ifile):
aln = LoadSeqs(ifile, moltype=DNA, aligned=True)
tr = build_tree_from_alignment(aln,moltype=DNA)
return tr

s = '''
AY005045.1 Streptococcus_mitis_bv2
D83363.1 Staphylococcus_epidermidis_14990
L14639.1 Capnocytophaga_gingivalis
AB053940.1 Tannerella_forsythensis_HA3
EU009197.1 Shigella_sonnei_FBD023
AB435616.1 Serratia_marcescens_JCM24201
AB302401.1 Pseudomonas_cinnamophila
AF411020.1 Achromobacter_xylosoxidans_AU1011
AJ401017.1 Thermotoga_maritima_SL7

fn1 = 'rRNA_gb.fasta'
fn2 = 'samples.fasta'
fn3 = 'samples.aln.fasta'
fn4 = 'samples.tree'

if not os.path.exists(fn1) or False:
if not os.path.exists(fn2) or False:

if not os.path.exists(fn3) or True:
aln = align_seqs(fn2,fn3)
tr = get_tree(fn3)
# re-root manually
print tr.asciiArt()
n = tr.getNodeMatchingName('Thermotoga')
for a in n.ancestors(): print a.Name

tr2 = tr.rootedAt(n.ancestors()[0].Name)
tree_str = tr2.getNewick(with_distances=True)
FH = open(fn4,'w')
FH.write(tree_str + '\n')

tr = LoadTree(fn4)
print tr.asciiArt()

R code:
tr = read.tree('temp/samples.tree')
colors = rep('black',length(tr$tip.label))
o = grep('A',tr$tip.label)
colors[o] = 'red'
o = grep('B',tr$tip.label)
colors[o] = 'blue'
o = grep('C',tr$tip.label)
colors[o] = 'darkgreen'

Unifrac analysis (2): P-test

I'm planning to discuss UniFrac in a series of posts (first here). But before moving further into it, I would like to define another statistic called the Phylogenetic-test or P-test. This test is available in UniFrac, but it was introduced earlier by Andrew Martin (Martin 2002 PMID 12147459).

Consider the same phylogenetic tree as in the first post. Here, the environment (sample) labels for a few of the sequences have been exchanged to provide a bit more heterogeneity in the tree.

The criterion of parsimony is employed to characterize the tree with respect to the labels, by computing the minimum number of "changes" (switches from one community to another), to explain the observed distribution. For example, we can imagine (in some theoretical sense) to have started with a blue label, and switched to red (and then to blue again) as indicated on the left; or alternatively, to have started with red and changed to blue, on the right. Here, the minimum number of changes needed to "explain" the tree is four.

The significance of the result is evaluated by label randomization (as for UniFrac itself), and the result is referred to as a P-test. If the randomization tests yield a statistic as extreme as that for the actual sample less than 5% of the time, then the distribution is considered to be significant. For a sample where one environment is distinct from another, fewer label switches are required. With completely separate distributions, the number of changes needed is one.