## Wednesday, June 24, 2009

### Motif Discovery 8

As promised, I've implemented a Gibbs Sampler motif search program. It's actually quite simple. If we have a discrete distribution of scores over all positions, considering an individual sequence that is sliding along a motif, we weight each position according to its score, and then choose a new position randomly based on the weights. Here's that part of the code:

 `def gibbsMove(L): x = abs(min(L)) + 0.1 L = [e+x for e in L] S = sum(L) c = 0 r = random.random() for i,s in enumerate(L): c += s f = c*1.0/S if f > r: return ((s,i)) raise ValueError('f < r')`

(I am still worried that the scoring function I'm using, as mentioned here, is incorrect and may be screwing up the position distributions).

The take home message from playing with the sampler is that it takes a lot of iterations for the distribution to settle down where it should be. For example, here is output from a run on a toy problem:

 ` numSeq = 5 motif length = 4 seq length = 13 cycles = 1000000 # runs = 1 freq distr = 1.0 4 ATAC ATCTATACAGTCA 3 ATAC CCAATACACATCC 8 ATAC TGCCTCGCATACG 8 ATAC AACTTATTATACA 1 ATAC AATACATTATCAAtarget:04 03 08 08 01top positions: count score04 03 08 08 01 629 18.404 05 08 08 01 501 15.704 09 08 08 01 450 15.703 02 07 07 00 414 13.705 04 09 09 02 401 15.704 03 08 06 01 336 13.004 03 08 05 01 336 13.004 03 08 08 08 335 14.200 03 08 08 01 332 13.005 04 09 07 02 318 13.0`

It is tough to get separation, though admittedly most of these look like off-by-one solutions. In this problem, the number of positions to check for each sequence is SZ - M + 1 = 10. So the total number of positions is 105. With a million cycles (and discarding the first 20%), we have 8 samples per position on the average. So the sampler does seem to be finding our motif fairly efficiently (629 times here).

But it seems clear that the efficiency has to get a lot better with increasing length of the sequences. With a sequence length of 1000 and 10 sequences, the number of positions is roughly 100010 = 1030.

I am going to have to redo parts of the code in C or C++ in order to test larger sequence sets. So far, real problems do not yield to my sampler even with 5 x 106 cycles.