Consider two short DNA sequences, both of length 26. We observe this when we align them:

Comparing positions with the same index, there is a total of 11 matches. What is the signficance of this similarity?

We formulate the null hypothesis: the sequences are random. Let's say that they come from a source where the nucleotides are at equal frequencies (A=C=G=T=0.25). We can model this as a sequence of Bernoulli trials with:

We calculate the probability that a result as extreme as this, or even more extreme, will be observed.

Here is Python code to do that. We pre-calculate and cache the values of the factorial function.

And here is the output:

As expected, the peak of the probability distribution is near 6-7 (= np = 6.5). (The probability for k > 15 is not zero, it is just very small and lost in the roundoff error).

We can also approximate this distribution by the

*normal*(np,npq), where np is the mean and npq is the variance.

The z-score for k = 11 is (11 - 6.5) / 2.21 = 2.03 which is a bit larger than that for a total cumulative probability of 95% (1.96). Hence we reject the null hypothesis.

And, of course, we can simulate the DNA sequences in Python. Here is a function to generate a random DNA sequence with arbitrary GC content and length n. (We make sequences in multiples of 100 to minimize roundoff error from adjusting for GC):

We do 10 reps of 1000 trials:

And we obtain:

Again, about 4% of the time we obtain a result this extreme (or even more extreme).

## No comments:

Post a Comment