I also had an idea that I thought at first would make the code faster, though probably it doesn't. Then I thought it would make the code clearer, though I realized after actual implementation that it doesn't do that either. What it does do is make our pass through the sequence simpler in logical terms, and it's the basis of my C-program to evaluate sites or motifs in a DNA sequence. We construct a circularly linked list, as illustrated schematically in the graphic. Each item (node) in the list holds the current value for an accumulating score that will become the score for an individual site in the sequence.
Probably it's best to illustrate with an example. Let's say we have a scoring system used to evaluate potential sites. For example, a site with a in position 1 (we'll use 1-based indexing here) contributes score a1, c in position 2 adds c2, etc.
Suppose we're looking for sites of length N = 4, and we have this sequence: acgtg. We construct a list like the following (shown here in rows to make the layout simple). After the initial priming, we end up with this:
I set up a toy example with scores of
0.01 .. 0.04 for a1 .. a4; 0.11 to 0.14 for c1 .. c4, etc.
Output from the priming phase looks like this:
Now we consider the next nucleotide in the sequence: t. We add the appropriate scores for t to each item in the list, then read the score for the item that contains a total of N scores, and finally, zero that item. The score we read is the score for the site that has sequence acgt. Schematically:
Output looks like this:
The next nucleotide is g:
The site acgt has score:
a1 + c2 + g3 + t4 = 0.01 + 0.12 + 0.23 + 0.34 = 0.70
The site cgtg has score:
c1 + g2 + t3 + g4 = 0.11 + 0.22 + 0.33 + 0.24 = 0.90
Zipped files on Dropbox (here). These will change in the next few days as I debug and test the project more...
The C code for the linked list looks like this: