But, simplest interesting problem has 3 sequences, and that means that one of the three position variables can't be plotted next to its neighbors. (Maybe I should look into a 3D visualization of density in R). Anyway, I kept it simple. We have sequences of length SZ = 15 and motifs of length M = 3, and the motifs are completely conserved. The motif generator produced:

That is, the motif TCA is embedded at the indicated locations in these sequences. All 13 x 13 x 13 = 2197 alignments were scored (Hamming distance). The two top positions are:

Here, the score is the first value, and the i,j,k indexes into the sequences follow. The second result is the position designed to have the motif, the first has a fortuitous match at position 8 of sequence # 2.

Next, I evaluated all positions with a score >= a threshhold (6 here), and asked: is it possible to change only one index (i.e. Gibbs sliding) and achieve a higher score? If so, we save the results in a list. If not, we are at a local maximum, and show an empty list [ ]. Here is the printout:

I think you can see that of 17 positions with score = 7, only six can be improved by Gibbs sliding. Even at the level of score = 6, three positions can not be improved. And at the level of score = 5, 17 positions can not be improved (not shown). Thus, it is clear there are many local maxima.

Two other things related to what we've been talking about, before I move on to write a Gibbs Sampler. First, I discovered a rather embarassing off-by-one error this morning. It is shown in this graphic:

The correct range to use for embedding or searching for motifs is SZ - M + 1, but my code used SZ - M Luckily, since I used the same expression in both places, the analysis is not in error. It only means that the effective size of the sequences is SZ -1, since the motif was never positioned (or searched for) at the extreme right end of the sequences.

The second issue has to do with the scoring function used by Lawrence

*et al*. Here is my table of the values for nucleotide counts between 0 and 10 (with a total number of sequences

*in the motif*of 10):

The first column is the count, the second is the q value with pseudocounts, the third is log2(q / p) and the fourth is the score = count x column 3. The problem I see is that the score for a nucleotide which does not appear in the motif against which it is being compared is 0, while the scores for counts of 1 and 2 are less than zero. That doesn't seem very logical.

## No comments:

Post a Comment