Saturday, June 20, 2009

Motif Discovery 5

For the scoring function, in my initial approach I used the Hamming distance because it was simple. Here, I implement the scoring function described in Lawrence (PMID 8211139). This screenshot shows equation 1 from the paper.

Consider a sequence alignment generated from: the list of sequences (L), the list of indexes where the current (tentative version of the) motif starts in each sequence (aL), and the index of the sequence to leave out of the calculation that follows (this is the sequence we are sliding). I call this alignment a motif list (mL) in my code. It is a list of strings.

We consider each position in the current version of the motif in turn. The index i refers to position in the motif, the index j refers to the individual symbols allowed (ACGT). c is the count of nt j at position i. We count the occurence of each nucleotide in the collection of sequences at index i.

Then we add pseudocounts (individually b, in total B).

So, for example, with N = 10 total sequences and 9 in the motif, not counting the one sequence we have left out, if we have 7 C's at a given position and add 1 pseudocount for each nt, we get c = 7 + 1 / 10 - 1 + 4 = 7 / 13 = q. By this method we generate a q value for each nucleotide at each position. We should generate 4 p values corresponding to the background probability for each nt based on the observed sequences, not counting the one being left out. (In my version I simply used 0.25 for each).

Now, given a particular nt observed at each position indexed by i in the candidate alignment with the current motif, we calculate (again, from the paper):

where we are summing over W, the width of the motif, and (sort of) over j. What we do is: at each position retrieve the nt observed in the candidate, count the number of times that nt occurs in the current version of the motif, compute what is basically P(count) x log(P(count)) and sum that over all positions. This is the score to be maximized.

With N-1 nt at each position in the motif, we can pre-calculate the score for each count. Also, the list of scores for each possible nt in the sequence along the range of the motif can also be pre-calculated. This makes the program faster.

I find the second summation symbol a bit confusing. The way I understand it (as mentioned above), at each position in the sequence under consideration there is only one particular nucleotide whose score needs to be retrieved and added to the total score for the alignment. It is only when we consider many different sequences that we have many different nucleotides at each position.

Here is the code:

# compute score by Lawrence et al formula
# for each count from 0..N
# assume background probabilities = 0.25
# (note N-1 correction already made)

def initScoreDict(N):
scoreD = dict()
for c in range(N+1):
q = (c+1) * 1.0 / (N+4)
d = log2(q*1.0/0.25)
scoreD[c] = c*d
#print c,q,d,c*d
#for k in scoreD: print k, scoreD[k]
return scoreD
# mL does not have the sequence under consideration
def initMotifScoreDict(mL,sD):
#print '\n'.join(mL)
#print '*********'
N = len(mL)
M = len(mL[0])

# for each position in the alignment
# for each nt at that position we need the score
# LL is a list of [nt at each index j]
LL = list()
for j in range(M):
LL.append([mL[i][j] for i in range(N)])

# for each index j into mL and LL
# for each n in 'ACGT'
# put the score for the given count of nt

msD = dict()
for j in range(M):
D2 = dict()
L = LL[j]
#print j,''.join(L)
for n in nt:
c = L.count(n)
s = sD[c]
#print n, c, s
D2[n] = s
msD[j] = D2
return msD