Python for Bioinformatics: DNA binding sites 7

Sunday, January 2, 2011

DNA binding sites 7

Continuing in the same vein as a number of recent posts (some links here), the Dropbox link below leads to my version of a site analysis program written in C. It's not very user friendly, for example, paths to the input files are hard-coded. A substantial part of the effort required went into ScoreList.c, which implements a function for constructing a list of scores from a list of nucleotide counts for each position, and a second function which reads such a score list in from disk.

The logic of the program was explained in the previous post. I won't spend much time explaining the details here.

Let's just say that if this blog has a motto it is: "learn by doing."

So... get in there and try writing your own version. If you get stuck, you can see how (or if) I handled the problem.

A few other notes: I tested the program by using a list of counts for crp sites obtained as described (here). The output shows the position, score and sequence of crp sites in the E. coli genome. The first few (threshold = 12) are:

$ ./test
   18968  14.33 tattgtgaactatcgcaaagaa
   42067  19.16 ttctgtgattggtatcacattt
   42187  12.44 attggtgatccataaaacaata
   49633  14.48 aagagtgacgtaaatcacactt
   70157  15.02 aagtgtgacgccgtgcaaataa
  141283  18.95 atgtgtgatcgtcatcacaatt
  141663  12.43 tgatgtgaaaatcctcaaagat
  184120  13.67 aattgtgcttattttagcattt
  223495  13.60 ggatgtgaatcacttcacacaa
  243785  12.34 atttctgacgttagtcatattt
  268498  13.40 taatgtgaacatgatcaacgaa
  281362  18.40 atatgtgatccagcttaaattt
  304272  12.45 tgttgttattcactacacgttt
  312539  17.25 ttttttgacatgtatcacaaat
  365617  14.65 atgagtgagctaactcacatta
..

After implementing the algorithm (which handles the sequence one character at a time) I realized that it would be good to have the site sequence available at the time of printing (as shown above). This required remembering the current site's sequence, which I grafted onto the program at the very end. Luckily it didn't change the running time by much. I think now that while the circular linked list seemed like an elegant approach, it caused as many problems as it solved, and I probably wouldn't do it again.

You can tell just from looking at the patterns that we are getting good matches to the crp consensus [T/A]₃TGTGAN₆TCACA[T/A]₃. Also, the last site is the lac 1 (lacZ) site---reversed. So I think it's working OK.

Finally, it runs in less than 2 seconds! That's a huge improvement on the previous time. Zipped files on Dropbox (here).