Wednesday, June 17, 2009

Motif Discovery 4

In a series of posts I plan to explore an algorithm published by Lawrence et al for identifying new motifs in DNA sequence, in Science in 1993 (PMID 8211139). It will give me an opportunity to use Python to explore Bioinformatics, as well as learning more about motif discovery and about the Gibbs Sampler, which is a specialized form of Markov Chain Monte Carlo (MCMC). As always, exploration is limited by my simple understanding of the subject, but I post the results on the internet with the hope that someday, someone will find them useful.

Previously, I put up some code for an initial version, and showed some output, and since then I've been exploring its limitations. I hasten to point out that this first version does not yet use the Gibbs Sampler, but it does a decent job of finding motifs when the sequences to be searched are not too long. Another simplification was to use the Hamming distance to compute the score for an alignment. We'll fix that limitation soon.

The code that we have so far is a simple, greedy hill-climbing algorithm. To help visualize the method, suppose that we have three sequences, each with an embedded motif awaiting discovery, where the second and third sequences are fortuitously aligned correctly.

Suppose also that we have previously identified the start points of the motifs in sequences 2 and 3. To identify the motif in the first sequence, we would then slide the it back and forth trying to align the nucleotides shown in red, somewhat like a pin tumbler lock (image from Wikipedia).

We keep the indexes for the motifs in a list, and the greedy approach is to try all possible start points for sequence #1 between index 0 and len(seq) - len(motif). We update the alignment list with the index which gives the best score. Since we do not actually know the correct motif start points in sequences 2 and 3, we run the process repeatedly, choosing a new sequence to slide in each iteration of the loop. It seems like a combination of sequential indexes followed by a limited number of random choices works best.

A major limitation of this approach is that it gets trapped by local maxima. I will try to show some visualizations of this soon. In the meantime, here is a spiffed up version of the basic search function.

# the greedy sampler:  consider one seq at a time
# slide to find the best score v. current motif
# repeat R times
# aL holds the pos of the motifs in seqs in L
# L is a list of the seqs
# D holds N (# seqs), SZ (len seq), M (len motif)
def slideForBestAlignment(L,D,aL=None):
N = D['N']
M = D['M']
SZ = D['SZ']
R = D['R']
if not aL: aL = init_aL(D)
for i in range(R):
if i < 4*N: currSeq = i%N
else: currSeq = random.choice(range(N))
mL = getAlignment(L,M,aL)
subL = list()
for j in range(SZ-M):
seq = L[currSeq][j:j+M]
score = scorefunction(mL + [seq])
maxS,maxI = subL[-1]
aL[currSeq] = maxI
return currSeq,aL,maxS