Python for Bioinformatics: Jukes-Cantor (8) - using the Poisson distribution

Tuesday, March 30, 2010

Jukes-Cantor (8) - using the Poisson distribution

In a previous post, I showed a derivation of the equations (from Higgs & Attwood) describing the probability of change at a sequence position as a function of time in the Jukes-Cantor model.

P_XX(t) = 1/4 + 3/4*e^-4*α*t
P_XY(t) = 1/4 - 1/4*e^-4*α*t

These were obtained by starting with this model

  t            Δt
A => [A,C,G,T] => A

and then deriving and solving the resulting differential equation. This seems to be a pretty standard treatment.

But in Chapter 11 of Felsenstein, I found an alternative approach based on the Poisson distribution, which I think is quite wonderful.

To begin, he defines rates in terms of u (= 3*λ), so the individual rates for changes from X to Y are u/3. He says:

The easiest way to compute this is to slightly fictionalize the model. Instead of having a rate u of change to one of the three other bases, let us imagine that we instead have a rate 4/3*u of change to a base randomly drawn from all four possibilities. This will be exactly the same process, as there then works out to be a probability of u/3 of change to each of the other three bases. We have also added a rate u/3 of change from a base to itself, which does not matter.

Now we use the Poisson distribution.

P(k) = λ^k / k! * e^-λ
P(0) = e^-λ

For a branch with time t, the

probability of no events at all at a site, when the number expected to occur is 4/3*u*t, is the zero term of the Poisson distribution…the branch of time t consists then of a vast number of tiny segments of time dt, each having the small probability of 4/3*u*dt of an event.

The probability of at least one event is 1 minus the zero term:

1 - e^-4/3*u*t )

The probability that it is a particular event (say, A to C) is:

1/4*(1 - e^-4/3*u*t )

The probability of change to any one of the three nucleotides is then:

3/4*(1 - e^-4/3*u*t )