Python for Bioinformatics: Motif Discovery 6

Saturday, June 20, 2009

Motif Discovery 6

I've been exploring results obtained using the code that I talked about in recent posts with this title, a simple greedy algorithm evaluating possible motifs within a set of sequences. This is a basic exploration of the problem without using the sophistication of the Gibbs Sampler, (which I do still promise to discuss). The parameters controlling output are:

N    the number of sequences given
M    the length of the motif embedded in the sequences
SZ   the length of each sequence
R    the number of cycles of the basic greedy algorithm
T    the number of independent runs
fL   a list of frequencies used in deriving the motif
     e.g. [0.9,0.9,0.8,0.8,0.7]

The scoring method is coded and described here, and comes from Lawrence et al (PMID 8211139).

The bottom line that I've come to in these simulations is that the simple version of the algorithm works pretty well. Of course, the results depend on the values chosen for the parameters above. Decreasing M and increasing SZ decreases the signal to noise ratio. I haven't played with N much, but I think more sequences should be better. Early trials led to setting R = 100---most cycles reached a plateau by 60 or so (these were tested with small values for M and SZ). I've changed the setting for T in relation to M and SZ in order to obtain positive results. It is probable that lower values of T might also show success.

Here is a table of my results. M, SZ and R are introduced above. FD refers to the frequency distribution for the motif, shown here. ratio refers to the fraction of the top 15 hits that are related to the motif. What it means to be related will be discussed later. A final point to note is that these results are "anecdotal" because they were all obtained with a single seed for the random number generator. Some tests were done with other seeds, but not a lot. So you can view this as a proof of principle.

M       SZ       T     FD      ratio
15     2000    1000    A       8 / 15
15     1000    1000    A      15 / 15
12     1000    1000    A      15 / 15
 9     2000    2000    B      15 / 15
 9     1000    1000    B      15 / 15
 9      250    100     B      14 / 15

FD:  A = 0.9 0.9 0.8 0.8 0.7
     B = 0.9 0.9 0.8

Probably the most stringent test is for a motif of 9 nt embedded in sequences of SZ = 2000. Following is the output for that test. The first part is the list of the top 15 results, the index list of the best motif, and its score. The second part shows the motifs in more detail. There is a discrepancy in the scoring, it is due to the fact that in the first section, scores were calculated by excluding the sequence currently in slide mode, whereas in the second part I calculated the score by averaging the results obtained by treating every sequence in the motif as in slide mode. The test runs pretty slowly, but this is Python, after all.

1802  585  132 1480 1619  830 1785  682  678 1422     111.94
1802  585  132 1480 1619  830 1785  682  678 1422     111.94
1802  585  132 1480 1619  830 1785  682  678 1422     111.94
1802  585  132 1480 1619  830 1785  682  678 1422     111.94
1802  585  132 1480 1619  830 1785  682  678 1422     111.94
1801  584  131 1479 1618  829 1873  681  677 1421     103.14
1801  584  131 1479 1618  829 1873  681  677 1421     103.14
1801  584  131 1479 1618  829 1873  681  677 1421     103.14
1802  585  132 1480 1619  830 1785  682  678 1422     101.04
1802  585  132 1480 1619  830 1785  682  678 1422     101.04
1803  264  133 1481 1620  831 1786  683  679 1423     100.59
1803  264  133 1481 1620  831 1786  683  679 1423     100.59
1803  264  133 1481 1620  831 1786  683  679 1423     100.59
1803  264  133 1481 1620  831 1786  683  679 1423     100.59
1803  264  133 1481 1620  831 1786  683  679 1423     100.59
------------------------------
rank = # 1
target      found      
ATGAGCAGG   ATGAGCAGG  
ATGAGCAGT   ATGAGCAGT  
ATGAGCAGT   ATGAGCAGT  
TTTATCAGT   GTGAGCAGA  
ATGAGCAGT   ATGAGCAGT  
ATGAGCAGT   ATGAGCAGT  
ATGAGCAAT   ATGAGCAAT  
ATGAGCAGT   ATGAGCAGT  
ATGAGCAGT   ATGAGCAGT  
ATGAGCAGT   ATGAGCAGT  
   103.0       107.7    
------------------------------
3012.6 sec