Sunday, January 2, 2011

DNA binding sites 7

Continuing in the same vein as a number of recent posts (some links here), the Dropbox link below leads to my version of a site analysis program written in C. It's not very user friendly, for example, paths to the input files are hard-coded. A substantial part of the effort required went into ScoreList.c, which implements a function for constructing a list of scores from a list of nucleotide counts for each position, and a second function which reads such a score list in from disk.

The logic of the program was explained in the previous post. I won't spend much time explaining the details here.

Let's just say that if this blog has a motto it is: "learn by doing."

So... get in there and try writing your own version. If you get stuck, you can see how (or if) I handled the problem.

A few other notes: I tested the program by using a list of counts for crp sites obtained as described (here). The output shows the position, score and sequence of crp sites in the E. coli genome. The first few (threshold = 12) are:


$ ./test
18968 14.33 tattgtgaactatcgcaaagaa
42067 19.16 ttctgtgattggtatcacattt
42187 12.44 attggtgatccataaaacaata
49633 14.48 aagagtgacgtaaatcacactt
70157 15.02 aagtgtgacgccgtgcaaataa
141283 18.95 atgtgtgatcgtcatcacaatt
141663 12.43 tgatgtgaaaatcctcaaagat
184120 13.67 aattgtgcttattttagcattt
223495 13.60 ggatgtgaatcacttcacacaa
243785 12.34 atttctgacgttagtcatattt
268498 13.40 taatgtgaacatgatcaacgaa
281362 18.40 atatgtgatccagcttaaattt
304272 12.45 tgttgttattcactacacgttt
312539 17.25 ttttttgacatgtatcacaaat
365617 14.65 atgagtgagctaactcacatta
..

After implementing the algorithm (which handles the sequence one character at a time) I realized that it would be good to have the site sequence available at the time of printing (as shown above). This required remembering the current site's sequence, which I grafted onto the program at the very end. Luckily it didn't change the running time by much. I think now that while the circular linked list seemed like an elegant approach, it caused as many problems as it solved, and I probably wouldn't do it again.

You can tell just from looking at the patterns that we are getting good matches to the crp consensus [T/A]3TGTGAN6TCACA[T/A]3. Also, the last site is the lac 1 (lacZ) site---reversed. So I think it's working OK.

Finally, it runs in less than 2 seconds! That's a huge improvement on the previous time. Zipped files on Dropbox (here).