Consider two short DNA sequences, both of length 26. We observe this when we align them:
Comparing positions with the same index, there is a total of 11 matches. What is the signficance of this similarity?
We formulate the null hypothesis: the sequences are random. Let's say that they come from a source where the nucleotides are at equal frequencies (A=C=G=T=0.25). We can model this as a sequence of Bernoulli trials with:
We calculate the probability that a result as extreme as this, or even more extreme, will be observed.
Here is Python code to do that. We pre-calculate and cache the values of the factorial function.
And here is the output:
As expected, the peak of the probability distribution is near 6-7 (= np = 6.5). (The probability for k > 15 is not zero, it is just very small and lost in the roundoff error).
We can also approximate this distribution by the normal(np,npq), where np is the mean and npq is the variance.
The z-score for k = 11 is (11 - 6.5) / 2.21 = 2.03 which is a bit larger than that for a total cumulative probability of 95% (1.96). Hence we reject the null hypothesis.
And, of course, we can simulate the DNA sequences in Python. Here is a function to generate a random DNA sequence with arbitrary GC content and length n. (We make sequences in multiples of 100 to minimize roundoff error from adjusting for GC):
We do 10 reps of 1000 trials:
And we obtain:
Again, about 4% of the time we obtain a result this extreme (or even more extreme).