Wednesday, June 24, 2009

Motif Discovery 9

Here is a slightly bigger problem, with M (motif width) = 6, SZ (seq length) = 50, and N (number of sequences) = 10. We do just 100,000 cycles at first. A first glance at the results is disappointing, since none of the positions discovered matches the target.


       numSeq =      10
motif length = 6
seq length = 50
cycles = 100000
# runs = 1
freq distr = 1.0

# the motif we generated:
30 ATACAA AACTTATTACCTATATTGGATTATCAAATGATACAAATCACAGCACCCTC
38 ATACAA TCGCGGTTTCCCCATACTACACGAGGTGTATGACTTTCATACAATTGCTT
39 ATACAA CTTTTGTGATTAATAACCTTCGGCAAAATATTGACACACATACAAATTCA
23 ATACAA AGGAATGAACACGGGCGCGTGCTATACAAAGTTAAAAATGAGTGAAGTCA
6 ATACAA GCCTTGATACAAGAGACTACCAGTCGTGATGGACCTCAAGTCGAAGACAA
6 ATACAA GACGAAATACAAAAGGCGATATTTCAGAGTCTGTAATACATATCAGCCCG
8 ATACAA CTGCATAGATACAATCCTAAAACAGTGACTCGCGAGTTGAACTGCATTCA
9 ATACAA GCGCCTTCTATACAATCGGTCTCCTGAGCAGGATCCGTGGGAACGGTAAT
20 ATACAA CATCTGACTAGATTTGTGAAATACAAGCCCTGGCCGCCACCACGGCGGGG
19 ATACAA CCCACCGGCTCGTACCTGAATACAAGCCATCAAAATTCTTCAGGTGACGG
----------------------------------------
# results

top positions: count score
44 08 20 18 13 29 13 06 04 36 6 19.7
23 40 41 00 42 02 08 39 16 30 6 26.5
26 27 39 23 11 09 04 09 20 19 5 32.4
30 16 39 06 06 06 13 07 04 19 5 38.0
14 29 35 29 18 04 39 11 01 28 5 14.2
03 36 08 31 09 02 08 09 20 29 4 24.0
41 37 41 07 43 05 07 08 17 11 4 25.6
30 38 10 23 06 21 09 09 20 19 4 43.3
20 38 39 23 39 09 15 09 14 19 4 28.3
14 13 39 23 06 10 14 09 20 19 4 40.2
----------------------------------------
target:
30 38 39 23 06 06 08 09 20 19


It is not so surprising since each was visited is only a few times (4-6). But if you look closely you can see that, for example, 19 is found in the target and also over-represented for the last position, and also 39 for the third. In fact, nearly all of these results are relatives of the target. For 6 out of 10, if we add a round of sliding to the top score (as described here), we can recover the target.

If we increase the number of cycles to 5 x 106, our target is the clear winner. With this problem, we have 4510 = 3.4 x 1016 positions, so with 4 x 106 samples, we've clearly improved enormously on random sampling, or more accurately, solved a problem that is nearly impossible by random sampling.

top positions:                  count  score
30 38 39 23 06 06 08 09 20 19 11 81.8
01 34 39 04 06 21 41 09 07 19 7 30.2
29 37 28 44 07 34 07 08 19 18 7 42.5
34 38 35 23 43 37 04 09 20 19 7 37.3
31 14 38 35 07 43 09 10 21 28 7 24.5
33 39 42 26 09 03 05 12 17 30 7 29.3
30 38 39 08 44 35 08 09 01 19 6 52.3
28 11 15 31 42 37 19 07 18 17 6 27.1
09 37 40 33 05 05 18 08 06 18 6 36.0
36 37 32 29 32 05 07 08 19 18 6 39.5
----------------------------------------
target:
30 38 39 23 06 06 08 09 20 19