## Monday, July 27, 2009

### Statistical Methods in Bioinformatics

I'm starting a new book with the same title as this post. It looks a little daunting in places, but we'll see how it goes. The first example is very simple.

Consider two short DNA sequences, both of length 26. We observe this when we align them:

 `ACGTACGTACGTACGTACGTACGTACTCCATCGATGCAAGGTTGGATCGAAC * ** * ** * ** **`

Comparing positions with the same index, there is a total of 11 matches. What is the signficance of this similarity?

We formulate the null hypothesis: the sequences are random. Let's say that they come from a source where the nucleotides are at equal frequencies (A=C=G=T=0.25). We can model this as a sequence of Bernoulli trials with:

 `p = 0.25 # probability of successn = 26 # number of trialsq = 1-p = 0.75 # probability of failure`

We calculate the probability that a result as extreme as this, or even more extreme, will be observed.

 `P = Σ for k = 0 to 10 of C(26,k) pk qn-k`

Here is Python code to do that. We pre-calculate and cache the values of the factorial function.

 `n = 26fL = [1] # a list of factorialsfor i in range(1,n+1): fL.append(fL[-1]*i) p = 0.25q = 1-pPtot = 0for k in range(17): C = fL[n]/(fL[k]*fL[n-k]) P = C * p**k * q**(n-k) Ptot += P pL = [str(e) for e in [k,n-k,n]] print ' '.join(pL).rjust(10), print str(round(P,4)).ljust(6), print str(round(Ptot,4)).ljust(6)`

And here is the output:

 ` 0 26 26 0.0006 0.0006 1 25 26 0.0049 0.0055 2 24 26 0.0204 0.0258 3 23 26 0.0544 0.0802 4 22 26 0.1042 0.1844 5 21 26 0.1528 0.3371 6 20 26 0.1782 0.5154 7 19 26 0.1698 0.6852 8 18 26 0.1344 0.8195 9 17 26 0.0896 0.9091 10 16 26 0.0508 0.9599 11 15 26 0.0246 0.9845 12 14 26 0.0103 0.9948 13 13 26 0.0037 0.9985 14 12 26 0.0011 0.9996 15 11 26 0.0003 0.9999 16 10 26 0.0001 1.0`

As expected, the peak of the probability distribution is near 6-7 (= np = 6.5). (The probability for k > 15 is not zero, it is just very small and lost in the roundoff error).

We can also approximate this distribution by the normal(np,npq), where np is the mean and npq is the variance.

 `m = np = 26 * 0.25 = 6.5var = npq = 6.5 * 0.75 = 4.875sd = 2.21`

The z-score for k = 11 is (11 - 6.5) / 2.21 = 2.03 which is a bit larger than that for a total cumulative probability of 95% (1.96). Hence we reject the null hypothesis.

And, of course, we can simulate the DNA sequences in Python. Here is a function to generate a random DNA sequence with arbitrary GC content and length n. (We make sequences in multiples of 100 to minimize roundoff error from adjusting for GC):

 `def randomDNA(N,GC=0.50): N = 100 * (N/100 + 1) g = int(N * GC/2.0) a = int(N * (1-GC)/2.0) L = list('G'*g +'A'*a +'C'*g +'T'*a) random.shuffle(L) return ''.join(L[:n])`

We do 10 reps of 1000 trials:

 `def matches(u,v): L = zip(u,v) L = [1 for n1,n2 in L if n1 == n2] return sum(L) reps = 10N = 1000print N,'trials:'for i in range(reps): Length = 26 Matches = 11 S = 0 rL = list() for j in range(N): dna1 = randomDNA(Length) dna2 = randomDNA(Length) m = matches(dna1,dna2) if m >= Matches: S += 1 f = 1.0*S/N print f rL.append(f)print 'avg =',sum(rL)/len(rL)`

And we obtain:

 `1000 trials:0.0420.0420.037...0.0430.0450.041avg = 0.041`

Again, about 4% of the time we obtain a result this extreme (or even more extreme).