Python for Bioinformatics: Motif Discovery 8

Wednesday, June 24, 2009

Motif Discovery 8

As promised, I've implemented a Gibbs Sampler motif search program. It's actually quite simple. If we have a discrete distribution of scores over all positions, considering an individual sequence that is sliding along a motif, we weight each position according to its score, and then choose a new position randomly based on the weights. Here's that part of the code:

def gibbsMove(L):
    x = abs(min(L)) + 0.1
    L = [e+x for e in L]
    S = sum(L)
    c = 0
    r = random.random()
    for i,s in enumerate(L):
        c += s
        f = c*1.0/S
        if f > r:  return ((s,i))
    raise ValueError('f < r')

(I am still worried that the scoring function I'm using, as mentioned here, is incorrect and may be screwing up the position distributions).

The take home message from playing with the sampler is that it takes a lot of iterations for the distribution to settle down where it should be. For example, here is output from a run on a toy problem:

       numSeq =       5
 motif length =       4
   seq length =      13
       cycles = 1000000
       # runs =       1
   freq distr =     1.0

    4 ATAC    ATCTATACAGTCA
    3 ATAC    CCAATACACATCC
    8 ATAC    TGCCTCGCATACG
    8 ATAC    AACTTATTATACA
    1 ATAC    AATACATTATCAA

target:
04 03 08 08 01

top positions:   count  score
04 03 08 08 01     629   18.4
04 05 08 08 01     501   15.7
04 09 08 08 01     450   15.7
03 02 07 07 00     414   13.7
05 04 09 09 02     401   15.7
04 03 08 06 01     336   13.0
04 03 08 05 01     336   13.0
04 03 08 08 08     335   14.2
00 03 08 08 01     332   13.0
05 04 09 07 02     318   13.0

It is tough to get separation, though admittedly most of these look like off-by-one solutions. In this problem, the number of positions to check for each sequence is SZ - M + 1 = 10. So the total number of positions is 10⁵. With a million cycles (and discarding the first 20%), we have 8 samples per position on the average. So the sampler does seem to be finding our motif fairly efficiently (629 times here).

But it seems clear that the efficiency has to get a lot better with increasing length of the sequences. With a sequence length of 1000 and 10 sequences, the number of positions is roughly 1000¹⁰ = 10³⁰.

I am going to have to redo parts of the code in C or C++ in order to test larger sequence sets. So far, real problems do not yield to my sampler even with 5 x 10⁶ cycles.