Here is an example that shows how difficult the problem of searching for sites is, at least by the method we've used. I include two scripts in the zipped files:
plot.py. The site we want to search for is hard-coded in the file as crp.
As its name implies, the first script searches the E. coli sequence. You'll need to get the sequence first (and save it in the right place---see the script) before this will run. We slide a window over the length of the sequence, shifting one base at a time, and score what's showing in the window under the site's scoring scheme. Scores were multiplied by 10 (to convert to ints) and then saved to disk (22 MB or so).
In the second phase, the
search.EC.pyscript also randomizes the sequence and re-runs the same search. The
plot.pyscript filters the results for those above a threshhold and plots a histogram of the results (see the above graphic). The red bars are the results obtained with the authentic sequence and the yellow bars are with the randomized sequence.
The point is that although the extreme high values are clearly higher in the real sequence, for a scoring range like 110 - 120 (that is 11.0 - 12.0 in the original scheme), the ratio of the likelihoods for the two models (real E. coli v. random) is not even greater than 2. So, upon observing a site with a value of 11.5, say, its significance isn't clear.
One idea that would improve the significance is to note that the genome is subject to selection. For the Crp system to work properly, there has likely been selection against randomly placed sites, so the random model is not really the appropriate one to test against.
Again, zipped files here.