## Saturday, June 20, 2009

### Motif Discovery 6

I've been exploring results obtained using the code that I talked about in recent posts with this title, a simple greedy algorithm evaluating possible motifs within a set of sequences. This is a basic exploration of the problem without using the sophistication of the Gibbs Sampler, (which I do still promise to discuss). The parameters controlling output are:

 `N the number of sequences givenM the length of the motif embedded in the sequencesSZ the length of each sequenceR the number of cycles of the basic greedy algorithmT the number of independent runsfL a list of frequencies used in deriving the motif e.g. [0.9,0.9,0.8,0.8,0.7]`

The scoring method is coded and described here, and comes from Lawrence et al (PMID 8211139).

The bottom line that I've come to in these simulations is that the simple version of the algorithm works pretty well. Of course, the results depend on the values chosen for the parameters above. Decreasing M and increasing SZ decreases the signal to noise ratio. I haven't played with N much, but I think more sequences should be better. Early trials led to setting R = 100---most cycles reached a plateau by 60 or so (these were tested with small values for M and SZ). I've changed the setting for T in relation to M and SZ in order to obtain positive results. It is probable that lower values of T might also show success.

Here is a table of my results. M, SZ and R are introduced above. FD refers to the frequency distribution for the motif, shown here. ratio refers to the fraction of the top 15 hits that are related to the motif. What it means to be related will be discussed later. A final point to note is that these results are "anecdotal" because they were all obtained with a single seed for the random number generator. Some tests were done with other seeds, but not a lot. So you can view this as a proof of principle.

 `M SZ T FD ratio15 2000 1000 A 8 / 1515 1000 1000 A 15 / 1512 1000 1000 A 15 / 15 9 2000 2000 B 15 / 15 9 1000 1000 B 15 / 15 9 250 100 B 14 / 15FD: A = 0.9 0.9 0.8 0.8 0.7 B = 0.9 0.9 0.8`

Probably the most stringent test is for a motif of 9 nt embedded in sequences of SZ = 2000. Following is the output for that test. The first part is the list of the top 15 results, the index list of the best motif, and its score. The second part shows the motifs in more detail. There is a discrepancy in the scoring, it is due to the fact that in the first section, scores were calculated by excluding the sequence currently in slide mode, whereas in the second part I calculated the score by averaging the results obtained by treating every sequence in the motif as in slide mode. The test runs pretty slowly, but this is Python, after all.

 `1802 585 132 1480 1619 830 1785 682 678 1422 111.941802 585 132 1480 1619 830 1785 682 678 1422 111.941802 585 132 1480 1619 830 1785 682 678 1422 111.941802 585 132 1480 1619 830 1785 682 678 1422 111.941802 585 132 1480 1619 830 1785 682 678 1422 111.941801 584 131 1479 1618 829 1873 681 677 1421 103.141801 584 131 1479 1618 829 1873 681 677 1421 103.141801 584 131 1479 1618 829 1873 681 677 1421 103.141802 585 132 1480 1619 830 1785 682 678 1422 101.041802 585 132 1480 1619 830 1785 682 678 1422 101.041803 264 133 1481 1620 831 1786 683 679 1423 100.591803 264 133 1481 1620 831 1786 683 679 1423 100.591803 264 133 1481 1620 831 1786 683 679 1423 100.591803 264 133 1481 1620 831 1786 683 679 1423 100.591803 264 133 1481 1620 831 1786 683 679 1423 100.59------------------------------rank = # 1target found ATGAGCAGG ATGAGCAGG ATGAGCAGT ATGAGCAGT ATGAGCAGT ATGAGCAGT TTTATCAGT GTGAGCAGA ATGAGCAGT ATGAGCAGT ATGAGCAGT ATGAGCAGT ATGAGCAAT ATGAGCAAT ATGAGCAGT ATGAGCAGT ATGAGCAGT ATGAGCAGT ATGAGCAGT ATGAGCAGT 103.0 107.7 ------------------------------3012.6 sec`