## Wednesday, June 24, 2009

### Motif Discovery 9

Here is a slightly bigger problem, with M (motif width) = 6, SZ (seq length) = 50, and N (number of sequences) = 10. We do just 100,000 cycles at first. A first glance at the results is disappointing, since none of the positions discovered matches the target.

 numSeq = 10 motif length = 6 seq length = 50 cycles = 100000 # runs = 1 freq distr = 1.0# the motif we generated: 30 ATACAA AACTTATTACCTATATTGGATTATCAAATGATACAAATCACAGCACCCTC 38 ATACAA TCGCGGTTTCCCCATACTACACGAGGTGTATGACTTTCATACAATTGCTT 39 ATACAA CTTTTGTGATTAATAACCTTCGGCAAAATATTGACACACATACAAATTCA 23 ATACAA AGGAATGAACACGGGCGCGTGCTATACAAAGTTAAAAATGAGTGAAGTCA 6 ATACAA GCCTTGATACAAGAGACTACCAGTCGTGATGGACCTCAAGTCGAAGACAA 6 ATACAA GACGAAATACAAAAGGCGATATTTCAGAGTCTGTAATACATATCAGCCCG 8 ATACAA CTGCATAGATACAATCCTAAAACAGTGACTCGCGAGTTGAACTGCATTCA 9 ATACAA GCGCCTTCTATACAATCGGTCTCCTGAGCAGGATCCGTGGGAACGGTAAT 20 ATACAA CATCTGACTAGATTTGTGAAATACAAGCCCTGGCCGCCACCACGGCGGGG 19 ATACAA CCCACCGGCTCGTACCTGAATACAAGCCATCAAAATTCTTCAGGTGACGG----------------------------------------# resultstop positions: count score44 08 20 18 13 29 13 06 04 36 6 19.723 40 41 00 42 02 08 39 16 30 6 26.526 27 39 23 11 09 04 09 20 19 5 32.430 16 39 06 06 06 13 07 04 19 5 38.014 29 35 29 18 04 39 11 01 28 5 14.203 36 08 31 09 02 08 09 20 29 4 24.041 37 41 07 43 05 07 08 17 11 4 25.630 38 10 23 06 21 09 09 20 19 4 43.320 38 39 23 39 09 15 09 14 19 4 28.314 13 39 23 06 10 14 09 20 19 4 40.2----------------------------------------target:30 38 39 23 06 06 08 09 20 19

It is not so surprising since each was visited is only a few times (4-6). But if you look closely you can see that, for example, 19 is found in the target and also over-represented for the last position, and also 39 for the third. In fact, nearly all of these results are relatives of the target. For 6 out of 10, if we add a round of sliding to the top score (as described here), we can recover the target.

If we increase the number of cycles to 5 x 106, our target is the clear winner. With this problem, we have 4510 = 3.4 x 1016 positions, so with 4 x 106 samples, we've clearly improved enormously on random sampling, or more accurately, solved a problem that is nearly impossible by random sampling.

 top positions: count score30 38 39 23 06 06 08 09 20 19 11 81.801 34 39 04 06 21 41 09 07 19 7 30.229 37 28 44 07 34 07 08 19 18 7 42.534 38 35 23 43 37 04 09 20 19 7 37.331 14 38 35 07 43 09 10 21 28 7 24.533 39 42 26 09 03 05 12 17 30 7 29.330 38 39 08 44 35 08 09 01 19 6 52.328 11 15 31 42 37 19 07 18 17 6 27.109 37 40 33 05 05 18 08 06 18 6 36.036 37 32 29 32 05 07 08 19 18 6 39.5----------------------------------------target:30 38 39 23 06 06 08 09 20 19