Monday, June 15, 2009

Motif Discovery

I've been looking at Jones & Pevzner's Introduction to Bioinformatics Algorithms again. I have to admit I've been skimming much of the code rather than trying to understand it---I'll have more to say about why I think that is some other time. However, they mention an approach to the problem of motif finding that struck a chord with me. I remember seeing it in a rather famous Science paper by C.E. Lawrence (PMID 8211139) whose title includes the phrase "Gibbs sampling strategy." This paper is special for me partly because when I first looked it up I couldn't understand it very well, but when I re-read it recently in connection with my course in Bioinformatics it seemed pretty easy. I thought I would be fun to explore the topic further here.

The problem, as I'm sure you know, is to align short sequences which are repeated in different places in the genome. For example, they could be binding sites for DNA-binding proteins upstream from promoters. These sites, or motifs, are not completely conserved. They may also contain non-conserved spacers between blocks of conserved nucleotides. In the worst case, the spacers may be heterogeneous in length. Shown here are some invented motifs generated by a Python program I will describe later. The graphic comes from Clustal. As you can see, positions within the motif have variable conservation. These motifs are then embedded in random sequence, and the second image shows an example of this.

In this example, each line represents a different upstream region with the motif embedded in it. There are 10 lines showing 10 regions of the genome, each with a single example of the motif. The goal is to slide the sequences back and forth one at a time in order to line up the red nucleotides of the motif.

(I wish I knew how to make the two sequences have the same resolution. They are the same in the original images but blogger pastes them into a box of equal size for the page, and if I increase the size of the second one to recover the resolution, it looks like crap. Suggestions would be welcome).

Lawrence et al. formalize the search in the following way. Start by choosing a particular set of columns from the current alignment as the prospective motif. (This requires having pre-chosen a value for the width of the motif we're looking for). Now, an individual sequence is chosen at random and moved back and forth. Each position is tested to find the best alignment of this one sequence against the prospective motif. It will happen eventually that two examples of the motif line up with each other and the other sequences fortuitously generate a decent overall score. Once two are aligned, the rest will follow easily.

I will leave Markov Chain Monte Carlo (MCMC) for later. But notice that we can view this as a maximization problem in N-dimensional space (the N sequences). The variables would be the position at which the motif is found in each sequence, and the value of the function is the result of a scoring system. As I understand a Gibbs Sampler,

To quote from this review (pdf),
The Gibbs sampler (introduced in the context of image processing by Geman and Geman 1984), is a special case of Metropolis-Hastings sampling wherein the random value is always accepted (i.e. α = 1). The task remains to specify how to construct a Markov Chain whose values converge to the target distribution. The key to the Gibbs sampler is that one only considers univariate conditional distributions — the distribution when all of the random variables but one are assigned fixed values. Such conditional distributions are far easier to simulate than complex joint distributions...To introduce the Gibbs sampler, consider a bivariate random variable (x, y)...The sampler starts with some initial value y0 for y and obtains x0 by generating a random variable from the conditional distribution p(x | y = y0 ). The sampler then uses x0 to generate a new value of y1 , drawing from the conditional distribution based on the value x0 , p(y | x = x).

That certainly sounds like what we're doing. We change one N-dimensional variable (slide one sequence) at a time. There is more to say here but let me close this post by considering the scoring function. In my initial attempts at this method, I'll be looking at DNA rather than protein, and rather than use a measure of entropy, I decided to use a something called the Hamming distance. According to Wikipedia:
In information theory, the Hamming distance between two strings of equal length is the number of positions for which the corresponding symbols are different. Put another way, it measures the minimum number of substitutions required to change one into the other, or the number of errors that transformed one string into the other.
For example, the Hamming distance between these two sequences is equal to the number of mismatches (5).

The trick is that we're considering a set of N motifs. To compute the total score, we must do it pairwise between all combinations. I paste the sequences together and compute the score for the whole thing. Here is the code:

def hammingScore(L,v=False):
a = list()
b = list()
for i in range(len(L)-1):
for j in range(i+1,len(L)):
a = ''.join(a)
b = ''.join(b)
diffs = 0
if v: print 'a', a
if v: print 'b', b
for c1,c2 in zip(a,b):
if not c1 == c2: diffs += 1
#print len(a)
return len(a) - diffs