Monday, July 27, 2009

Statistical Methods in Bioinformatics

I'm starting a new book with the same title as this post. It looks a little daunting in places, but we'll see how it goes. The first example is very simple.

Consider two short DNA sequences, both of length 26. We observe this when we align them:


ACGTACGTACGTACGTACGTACGTAC
TCCATCGATGCAAGGTTGGATCGAAC
* ** * ** * ** **


Comparing positions with the same index, there is a total of 11 matches. What is the signficance of this similarity?

We formulate the null hypothesis: the sequences are random. Let's say that they come from a source where the nucleotides are at equal frequencies (A=C=G=T=0.25). We can model this as a sequence of Bernoulli trials with:

p = 0.25         # probability of success
n = 26 # number of trials
q = 1-p = 0.75 # probability of failure


We calculate the probability that a result as extreme as this, or even more extreme, will be observed.

P = Σ for k = 0 to 10 of C(26,k) pk qn-k


Here is Python code to do that. We pre-calculate and cache the values of the factorial function.

n = 26
fL = [1] # a list of factorials
for i in range(1,n+1):
fL.append(fL[-1]*i)

p = 0.25
q = 1-p

Ptot = 0
for k in range(17):
C = fL[n]/(fL[k]*fL[n-k])
P = C * p**k * q**(n-k)
Ptot += P
pL = [str(e) for e in [k,n-k,n]]
print ' '.join(pL).rjust(10),
print str(round(P,4)).ljust(6),
print str(round(Ptot,4)).ljust(6)


And here is the output:

   0 26 26 0.0006 0.0006
1 25 26 0.0049 0.0055
2 24 26 0.0204 0.0258
3 23 26 0.0544 0.0802
4 22 26 0.1042 0.1844
5 21 26 0.1528 0.3371
6 20 26 0.1782 0.5154
7 19 26 0.1698 0.6852
8 18 26 0.1344 0.8195
9 17 26 0.0896 0.9091
10 16 26 0.0508 0.9599
11 15 26 0.0246 0.9845
12 14 26 0.0103 0.9948
13 13 26 0.0037 0.9985
14 12 26 0.0011 0.9996
15 11 26 0.0003 0.9999
16 10 26 0.0001 1.0


As expected, the peak of the probability distribution is near 6-7 (= np = 6.5). (The probability for k > 15 is not zero, it is just very small and lost in the roundoff error).

We can also approximate this distribution by the normal(np,npq), where np is the mean and npq is the variance.

m   = np  = 26 * 0.25  = 6.5
var = npq = 6.5 * 0.75 = 4.875
sd = 2.21


The z-score for k = 11 is (11 - 6.5) / 2.21 = 2.03 which is a bit larger than that for a total cumulative probability of 95% (1.96). Hence we reject the null hypothesis.

And, of course, we can simulate the DNA sequences in Python. Here is a function to generate a random DNA sequence with arbitrary GC content and length n. (We make sequences in multiples of 100 to minimize roundoff error from adjusting for GC):

def randomDNA(N,GC=0.50):
N = 100 * (N/100 + 1)
g = int(N * GC/2.0)
a = int(N * (1-GC)/2.0)
L = list('G'*g +'A'*a +'C'*g +'T'*a)
random.shuffle(L)
return ''.join(L[:n])


We do 10 reps of 1000 trials:

def matches(u,v):
L = zip(u,v)
L = [1 for n1,n2 in L if n1 == n2]
return sum(L)

reps = 10
N = 1000
print N,'trials:'

for i in range(reps):
Length = 26
Matches = 11
S = 0
rL = list()
for j in range(N):
dna1 = randomDNA(Length)
dna2 = randomDNA(Length)
m = matches(dna1,dna2)
if m >= Matches:
S += 1
f = 1.0*S/N
print f
rL.append(f)
print 'avg =',sum(rL)/len(rL)


And we obtain:

1000 trials:
0.042
0.042
0.037
...
0.043
0.045
0.041
avg = 0.041


Again, about 4% of the time we obtain a result this extreme (or even more extreme).