Python for Bioinformatics: June 2009

Friday, June 26, 2009

Code simplicity

This is hilarious:

If somebody comes up to you and says something like, “How do I make this pony fly to the moon?”, the question you need to ask is, “What problem are you trying to solve?” You’ll find out that they really need to collect gray rocks. Why they thought they had to fly to the moon, and use a pony to do it, only they know. People do get confused like this.

Leopard and gcc

Starting from the very beginning with C++, here is that famous chestnut (filename is test.cpp):


int main () {
    std::cout << "Hello, world!";
    std::cout << std::endl;
    return 0;
}

And immediately I'm scratching my head at weirdness on the command line in Leopard. There is a difference between gcc and g++. Why? It seems the linker is having trouble finding libstdc++. Now I have to remember to use g++.

$ g++ -o test test.cpp
$ ./test
Hello, world!
$ gcc -o test test.cpp
Undefined symbols:
...
...
  "std::cout", referenced from:
      __ZSt4cout$non_lazy_ptr in ccJ7ynTQ.o
ld: symbol(s) not found
collect2: ld returned 1 exit status

Department of "WTF".

Thursday, June 25, 2009

Motif Discovery [-1]

I had hoped to show a working Gibbs Sampler for a real problem (say 10 sequences of length = 1000 each). My tests show that I will need a lot of cycles to make that work. I did write a short C program to do the time-consuming parts, but have so far failed to integrate it into a Cocoa project. Probably most people wouldn't find that useful, anyway, because it's Apple-specific. I will try to brush up my C++ in the future and do the whole thing in that.

Anyway, I'm moving on. I still have some Bayes to do, and after that I'm not sure. I need to read Drew McCormack. I should go back to phylogenetics more seriously. I could tell you more about the course I taught called "Bioinformatics for Biologists," or show the material I developed for it.

By the way, if you have read and appreciated any of my posts, I'd enjoy hearing from you.

I think of that famous question: "if a tree falls in the forest, and no one is there, does it make a sound?" It is said to be a Zen koan, but I don't have a source. Maybe it is apocryphal.

It is OK if no one is listening. I view this blog as the place I store the homeowork for my self-study class in Bioinformatics. I don't like reading other people's code, and I expect you don't like reading mine. But I get a huge learning boost from programming. It has happened many times that I only really understood something after I made it work in my own code.

Besides, even if there is no single person listening in the world right now, the internet lives forever. And the signal-to-noise ratio will improve with time. Google is just the beginning.

Motif Discovery 10

This post will demonstrate the world's simplest Gibbs Sampler. We first construct a 2D distribution by hand. The Python code is:

# a very simple Gibbs sampler
import random,sys
#---------------------------------
# a discrete distribution
R = range(10)
distr = [
     [1,1,2,2,1,1,1,1,1,1],
     [1,3,4,5,2,1,1,1,1,1],
     [1,4,7,6,2,1,1,1,1,1],
     [1,3,4,3,2,1,1,1,1,1],
     [1,1,1,1,1,1,1,1,1,1],
     [1,1,1,1,1,1,1,1,1,1],
     [1,1,1,1,1,2,2,3,2,1],
     [1,1,1,1,1,2,3,4,2,1],
     [1,1,1,1,1,2,2,5,3,1],
     [1,1,1,1,1,1,1,1,2,1]]

def f(x,y):
    return distr[x][y]
#---------------------------------
# save distribution for R plot
FH = open('results.txt','w')
for x in R:
    for y in R:
        # convert to 1-based index
        t = (x+1,y+1,f(x,y))
        L = [str(e) for e in t]
        FH.write('  '.join(L) + '\n')
FH.close()

Two plots of the values (rotate the table above to see the correspondence):

The R code to do the plots:

Rcode
setwd('Desktop')
library(scatterplot3d)
m = read.table('results.txt',as.is=T)
par(las=1)
scatterplot3d(m,
  type='h',pch=16,
  cex.symbol=3,
  cex.axis=2,
  highlight.3d = T,
  xlab='x',ylab='y',
  zlab='z')

s = m[,3]
dim(s) = c(10,10)
s = t(s)
image(s,
  col=topo.colors(10))

The sampler is simplicity itself:

def gibbsMove(L):
    L = [t[0] for t in L]
    S = sum(L)
    cumulative = 0
    # weight by score
    r = random.random()
    for i,s in enumerate(L):
        cumulative += s
        f = cumulative*1.0/S
        if f > r:  return i
    raise ValueError('f < r')

Here are the top scores. The count is the number of times the sampler visited that value; the ratio of the count to the actual score (or value of the distribution at the point), is nearly constant.

c = count, s = score
  c   c/s     ( s,  x,  y)
4208 601.14  ( 7.0 , 2 , 2 )
3789 631.5   ( 6.0 , 2 , 3 )
3239 647.8   ( 5.0 , 1 , 3 )
3103 620.6   ( 5.0 , 8 , 7 )
2636 659.0   ( 4.0 , 1 , 2 )
2551 637.75  ( 4.0 , 7 , 7 )
2465 616.25  ( 4.0 , 2 , 1 )
2466 616.5   ( 4.0 , 3 , 2 )
1929 643.0   ( 3.0 , 8 , 8 )
1894 631.33  ( 3.0 , 6 , 7 )
1901 633.67  ( 3.0 , 1 , 1 )
1893 631.0   ( 3.0 , 3 , 3 )
1842 614.0   ( 3.0 , 3 , 1 )
1971 657.0   ( 3.0 , 7 , 6 )
1315 657.5   ( 2.0 , 8 , 6 )
1374 687.0   ( 2.0 , 1 , 4 )
1268 634.0   ( 2.0 , 7 , 8 )
1394 697.0   ( 2.0 , 6 , 8 )
1363 681.5   ( 2.0 , 9 , 8 )
1227 613.5   ( 2.0 , 0 , 2 )

Here is the rest of the code needed to run the example and show the results:

x = random.choice(R)
y = random.choice(R)
L = [[f(x,y),x,y]]
t = ('x','y')

T = int(1E5)
for i in range(T):
    L2 = list()
    if random.choice(t) == 'x':
        y = L[-1][2]
        for x in R:  L2.append((f(x,y),x,y))
    else:
        x = L[-1][1]
        for y in R:  L2.append((f(x,y),x,y))
        
    j = gibbsMove(L2)
    L.append(L2[j])
#---------------------------------
# show results
D = dict()
for t in L:
    t = tuple(t)
    if D.has_key(t):  D[t] += 1
    else:  D[t] = 1
    
kL = sorted(D.keys(),key=lambda t:  t[0])
kL.reverse()

def show(t):
    counts = D[t]
    score = t[0]
    print str(counts).rjust(4),
    ratio = 1.0*counts/score
    print str(round(ratio,2)).ljust(7),
    print '(' + str(round(score,1)).rjust(4),
    print ',',t[1],',',t[2],')'

print 'c = count, s = score'
print '  c   c/s     ( s,  x,  y)'
for t in kL[:20]:  show(t)
print '-' * 10
for t in kL[-20:]: show(t)

Update: fixed a bug in "show."

Wednesday, June 24, 2009

Motif Discovery 9

Here is a slightly bigger problem, with M (motif width) = 6, SZ (seq length) = 50, and N (number of sequences) = 10. We do just 100,000 cycles at first. A first glance at the results is disappointing, since none of the positions discovered matches the target.

       numSeq =      10
 motif length =       6
   seq length =      50
       cycles =  100000
       # runs =       1
   freq distr =     1.0

# the motif we generated:
   30 ATACAA    AACTTATTACCTATATTGGATTATCAAATGATACAAATCACAGCACCCTC
   38 ATACAA    TCGCGGTTTCCCCATACTACACGAGGTGTATGACTTTCATACAATTGCTT
   39 ATACAA    CTTTTGTGATTAATAACCTTCGGCAAAATATTGACACACATACAAATTCA
   23 ATACAA    AGGAATGAACACGGGCGCGTGCTATACAAAGTTAAAAATGAGTGAAGTCA
    6 ATACAA    GCCTTGATACAAGAGACTACCAGTCGTGATGGACCTCAAGTCGAAGACAA
    6 ATACAA    GACGAAATACAAAAGGCGATATTTCAGAGTCTGTAATACATATCAGCCCG
    8 ATACAA    CTGCATAGATACAATCCTAAAACAGTGACTCGCGAGTTGAACTGCATTCA
    9 ATACAA    GCGCCTTCTATACAATCGGTCTCCTGAGCAGGATCCGTGGGAACGGTAAT
   20 ATACAA    CATCTGACTAGATTTGTGAAATACAAGCCCTGGCCGCCACCACGGCGGGG
   19 ATACAA    CCCACCGGCTCGTACCTGAATACAAGCCATCAAAATTCTTCAGGTGACGG
----------------------------------------
# results

top positions:                  count  score
44 08 20 18 13 29 13 06 04 36       6   19.7
23 40 41 00 42 02 08 39 16 30       6   26.5
26 27 39 23 11 09 04 09 20 19       5   32.4
30 16 39 06 06 06 13 07 04 19       5   38.0
14 29 35 29 18 04 39 11 01 28       5   14.2
03 36 08 31 09 02 08 09 20 29       4   24.0
41 37 41 07 43 05 07 08 17 11       4   25.6
30 38 10 23 06 21 09 09 20 19       4   43.3
20 38 39 23 39 09 15 09 14 19       4   28.3
14 13 39 23 06 10 14 09 20 19       4   40.2
----------------------------------------
target:
30 38 39 23 06 06 08 09 20 19

It is not so surprising since each was visited is only a few times (4-6). But if you look closely you can see that, for example, 19 is found in the target and also over-represented for the last position, and also 39 for the third. In fact, nearly all of these results are relatives of the target. For 6 out of 10, if we add a round of sliding to the top score (as described here), we can recover the target.

If we increase the number of cycles to 5 x 10⁶, our target is the clear winner. With this problem, we have 45¹⁰ = 3.4 x 10¹⁶ positions, so with 4 x 10⁶ samples, we've clearly improved enormously on random sampling, or more accurately, solved a problem that is nearly impossible by random sampling.

top positions:                  count  score
30 38 39 23 06 06 08 09 20 19      11   81.8
01 34 39 04 06 21 41 09 07 19       7   30.2
29 37 28 44 07 34 07 08 19 18       7   42.5
34 38 35 23 43 37 04 09 20 19       7   37.3
31 14 38 35 07 43 09 10 21 28       7   24.5
33 39 42 26 09 03 05 12 17 30       7   29.3
30 38 39 08 44 35 08 09 01 19       6   52.3
28 11 15 31 42 37 19 07 18 17       6   27.1
09 37 40 33 05 05 18 08 06 18       6   36.0
36 37 32 29 32 05 07 08 19 18       6   39.5
----------------------------------------
target:
30 38 39 23 06 06 08 09 20 19

Motif Discovery 8

As promised, I've implemented a Gibbs Sampler motif search program. It's actually quite simple. If we have a discrete distribution of scores over all positions, considering an individual sequence that is sliding along a motif, we weight each position according to its score, and then choose a new position randomly based on the weights. Here's that part of the code:

def gibbsMove(L):
    x = abs(min(L)) + 0.1
    L = [e+x for e in L]
    S = sum(L)
    c = 0
    r = random.random()
    for i,s in enumerate(L):
        c += s
        f = c*1.0/S
        if f > r:  return ((s,i))
    raise ValueError('f < r')

(I am still worried that the scoring function I'm using, as mentioned here, is incorrect and may be screwing up the position distributions).

The take home message from playing with the sampler is that it takes a lot of iterations for the distribution to settle down where it should be. For example, here is output from a run on a toy problem:

       numSeq =       5
 motif length =       4
   seq length =      13
       cycles = 1000000
       # runs =       1
   freq distr =     1.0

    4 ATAC    ATCTATACAGTCA
    3 ATAC    CCAATACACATCC
    8 ATAC    TGCCTCGCATACG
    8 ATAC    AACTTATTATACA
    1 ATAC    AATACATTATCAA

target:
04 03 08 08 01

top positions:   count  score
04 03 08 08 01     629   18.4
04 05 08 08 01     501   15.7
04 09 08 08 01     450   15.7
03 02 07 07 00     414   13.7
05 04 09 09 02     401   15.7
04 03 08 06 01     336   13.0
04 03 08 05 01     336   13.0
04 03 08 08 08     335   14.2
00 03 08 08 01     332   13.0
05 04 09 07 02     318   13.0

It is tough to get separation, though admittedly most of these look like off-by-one solutions. In this problem, the number of positions to check for each sequence is SZ - M + 1 = 10. So the total number of positions is 10⁵. With a million cycles (and discarding the first 20%), we have 8 samples per position on the average. So the sampler does seem to be finding our motif fairly efficiently (629 times here).

But it seems clear that the efficiency has to get a lot better with increasing length of the sequences. With a sequence length of 1000 and 10 sequences, the number of positions is roughly 1000¹⁰ = 10³⁰.

I am going to have to redo parts of the code in C or C++ in order to test larger sequence sets. So far, real problems do not yield to my sampler even with 5 x 10⁶ cycles.

Sunday, June 21, 2009

Motif Discovery 7

I did some exploration of the scoring function's landscape to demonstrate the existence of local maxima. I had thought to work up a graphic display, perhaps along these lines:

But, simplest interesting problem has 3 sequences, and that means that one of the three position variables can't be plotted next to its neighbors. (Maybe I should look into a 3D visualization of density in R). Anyway, I kept it simple. We have sequences of length SZ = 15 and motifs of length M = 3, and the motifs are completely conserved. The motif generator produced:

(1, 'TCA', 'CTCAGTTCGAGAGGA')
(2, 'TCA', 'CTTCAAAGTCAGCTA')
(7, 'TCA', 'CCATACCTCAATGAA')

That is, the motif TCA is embedded at the indicated locations in these sequences. All 13 x 13 x 13 = 2197 alignments were scored (Hamming distance). The two top positions are:

9, 1, 8, 7
9, 1, 2, 7

Here, the score is the first value, and the i,j,k indexes into the sequences follow. The second result is the position designed to have the motif, the first has a fortuitous match at position 8 of sequence # 2.

Next, I evaluated all positions with a score >= a threshhold (6 here), and asked: is it possible to change only one index (i.e. Gibbs sliding) and achieve a higher score? If so, we save the results in a list. If not, we are at a local maximum, and show an empty list [ ]. Here is the printout:


 9, 1, 8, 7 :   []
 9, 1, 2, 7 :   []
 7, 7, 3, 8 :   []
 7, 6, 8, 7 :   [(9, 1, 8, 7)]
 7, 6, 2, 7 :   [(9, 1, 2, 7)]
 7, 5, 1, 6 :   []
 7, 5, 1, 3 :   []
 7, 3, 6, 9 :   []
 7, 2, 9, 8 :   []
 7, 2, 9, 1 :   []
 7, 2, 3, 8 :   []
 7, 1, 8,11 :   [(9, 1, 8, 7)]
 7, 1, 8, 0 :   [(9, 1, 8, 7)]
 7, 1, 2,11 :   [(9, 1, 2, 7)]
 7, 1, 2, 0 :   [(9, 1, 2, 7)]
 7, 0,12, 6 :   []
 7, 0, 7, 6 :   []
 7, 0, 1, 6 :   []
 7, 0, 0, 6 :   []
 6,11, 5,10 :   []
 6, 9, 4, 2 :   []
 6, 7,12, 8 :   [(7, 7, 3, 8)]
 6, 7,12, 0 :   []
 6, 7, 3, 0 :   [(7, 7, 3, 8)]
 6, 5, 7, 6 :   [(7, 5, 1, 6), (7, 0, 7, 6)]
 6, 2, 3, 1 :   [(7, 2, 9, 1), (7, 2, 3, 8)]

I think you can see that of 17 positions with score = 7, only six can be improved by Gibbs sliding. Even at the level of score = 6, three positions can not be improved. And at the level of score = 5, 17 positions can not be improved (not shown). Thus, it is clear there are many local maxima.

Two other things related to what we've been talking about, before I move on to write a Gibbs Sampler. First, I discovered a rather embarassing off-by-one error this morning. It is shown in this graphic:

The correct range to use for embedding or searching for motifs is SZ - M + 1, but my code used SZ - M Luckily, since I used the same expression in both places, the analysis is not in error. It only means that the effective size of the sequences is SZ -1, since the motif was never positioned (or searched for) at the extreme right end of the sequences.

The second issue has to do with the scoring function used by Lawrence et al. Here is my table of the values for nucleotide counts between 0 and 10 (with a total number of sequences in the motif of 10):

 0  0.07 -1.81   0.0
 1  0.14 -0.81 -0.81
 2  0.21 -0.22 -0.44
 3  0.29  0.19  0.58
 4  0.36  0.51  2.06
 5  0.43  0.78  3.89
 6   0.5   1.0   6.0
 7  0.57  1.19  8.35
 8  0.64  1.36  10.9
 9  0.71  1.51 13.63
10  0.79  1.65 16.52

The first column is the count, the second is the q value with pseudocounts, the third is log2(q / p) and the fourth is the score = count x column 3. The problem I see is that the score for a nucleotide which does not appear in the motif against which it is being compared is 0, while the scores for counts of 1 and 2 are less than zero. That doesn't seem very logical.

Saturday, June 20, 2009

Motif Discovery 6

I've been exploring results obtained using the code that I talked about in recent posts with this title, a simple greedy algorithm evaluating possible motifs within a set of sequences. This is a basic exploration of the problem without using the sophistication of the Gibbs Sampler, (which I do still promise to discuss). The parameters controlling output are:

N    the number of sequences given
M    the length of the motif embedded in the sequences
SZ   the length of each sequence
R    the number of cycles of the basic greedy algorithm
T    the number of independent runs
fL   a list of frequencies used in deriving the motif
     e.g. [0.9,0.9,0.8,0.8,0.7]

The scoring method is coded and described here, and comes from Lawrence et al (PMID 8211139).

The bottom line that I've come to in these simulations is that the simple version of the algorithm works pretty well. Of course, the results depend on the values chosen for the parameters above. Decreasing M and increasing SZ decreases the signal to noise ratio. I haven't played with N much, but I think more sequences should be better. Early trials led to setting R = 100---most cycles reached a plateau by 60 or so (these were tested with small values for M and SZ). I've changed the setting for T in relation to M and SZ in order to obtain positive results. It is probable that lower values of T might also show success.

Here is a table of my results. M, SZ and R are introduced above. FD refers to the frequency distribution for the motif, shown here. ratio refers to the fraction of the top 15 hits that are related to the motif. What it means to be related will be discussed later. A final point to note is that these results are "anecdotal" because they were all obtained with a single seed for the random number generator. Some tests were done with other seeds, but not a lot. So you can view this as a proof of principle.

M       SZ       T     FD      ratio
15     2000    1000    A       8 / 15
15     1000    1000    A      15 / 15
12     1000    1000    A      15 / 15
 9     2000    2000    B      15 / 15
 9     1000    1000    B      15 / 15
 9      250    100     B      14 / 15

FD:  A = 0.9 0.9 0.8 0.8 0.7
     B = 0.9 0.9 0.8

Probably the most stringent test is for a motif of 9 nt embedded in sequences of SZ = 2000. Following is the output for that test. The first part is the list of the top 15 results, the index list of the best motif, and its score. The second part shows the motifs in more detail. There is a discrepancy in the scoring, it is due to the fact that in the first section, scores were calculated by excluding the sequence currently in slide mode, whereas in the second part I calculated the score by averaging the results obtained by treating every sequence in the motif as in slide mode. The test runs pretty slowly, but this is Python, after all.

1802  585  132 1480 1619  830 1785  682  678 1422     111.94
1802  585  132 1480 1619  830 1785  682  678 1422     111.94
1802  585  132 1480 1619  830 1785  682  678 1422     111.94
1802  585  132 1480 1619  830 1785  682  678 1422     111.94
1802  585  132 1480 1619  830 1785  682  678 1422     111.94
1801  584  131 1479 1618  829 1873  681  677 1421     103.14
1801  584  131 1479 1618  829 1873  681  677 1421     103.14
1801  584  131 1479 1618  829 1873  681  677 1421     103.14
1802  585  132 1480 1619  830 1785  682  678 1422     101.04
1802  585  132 1480 1619  830 1785  682  678 1422     101.04
1803  264  133 1481 1620  831 1786  683  679 1423     100.59
1803  264  133 1481 1620  831 1786  683  679 1423     100.59
1803  264  133 1481 1620  831 1786  683  679 1423     100.59
1803  264  133 1481 1620  831 1786  683  679 1423     100.59
1803  264  133 1481 1620  831 1786  683  679 1423     100.59
------------------------------
rank = # 1
target      found      
ATGAGCAGG   ATGAGCAGG  
ATGAGCAGT   ATGAGCAGT  
ATGAGCAGT   ATGAGCAGT  
TTTATCAGT   GTGAGCAGA  
ATGAGCAGT   ATGAGCAGT  
ATGAGCAGT   ATGAGCAGT  
ATGAGCAAT   ATGAGCAAT  
ATGAGCAGT   ATGAGCAGT  
ATGAGCAGT   ATGAGCAGT  
ATGAGCAGT   ATGAGCAGT  
   103.0       107.7    
------------------------------
3012.6 sec

Motif Discovery 5

For the scoring function, in my initial approach I used the Hamming distance because it was simple. Here, I implement the scoring function described in Lawrence (PMID 8211139). This screenshot shows equation 1 from the paper.

Consider a sequence alignment generated from: the list of sequences (L), the list of indexes where the current (tentative version of the) motif starts in each sequence (aL), and the index of the sequence to leave out of the calculation that follows (this is the sequence we are sliding). I call this alignment a motif list (mL) in my code. It is a list of strings.

We consider each position in the current version of the motif in turn. The index i refers to position in the motif, the index j refers to the individual symbols allowed (ACGT). c is the count of nt j at position i. We count the occurence of each nucleotide in the collection of sequences at index i.

Then we add pseudocounts (individually b, in total B).

So, for example, with N = 10 total sequences and 9 in the motif, not counting the one sequence we have left out, if we have 7 C's at a given position and add 1 pseudocount for each nt, we get c = 7 + 1 / 10 - 1 + 4 = 7 / 13 = q. By this method we generate a q value for each nucleotide at each position. We should generate 4 p values corresponding to the background probability for each nt based on the observed sequences, not counting the one being left out. (In my version I simply used 0.25 for each).

Now, given a particular nt observed at each position indexed by i in the candidate alignment with the current motif, we calculate (again, from the paper):

where we are summing over W, the width of the motif, and (sort of) over j. What we do is: at each position retrieve the nt observed in the candidate, count the number of times that nt occurs in the current version of the motif, compute what is basically P(count) x log(P(count)) and sum that over all positions. This is the score to be maximized.

With N-1 nt at each position in the motif, we can pre-calculate the score for each count. Also, the list of scores for each possible nt in the sequence along the range of the motif can also be pre-calculated. This makes the program faster.

I find the second summation symbol a bit confusing. The way I understand it (as mentioned above), at each position in the sequence under consideration there is only one particular nucleotide whose score needs to be retrieved and added to the total score for the alignment. It is only when we consider many different sequences that we have many different nucleotides at each position.

Here is the code:

# compute score by Lawrence et al formula
# for each count from 0..N
# assume background probabilities = 0.25
# (note N-1 correction already made)

def initScoreDict(N):
    scoreD = dict()
    for c in range(N+1):
        q = (c+1) * 1.0 / (N+4)
        d = log2(q*1.0/0.25)
        scoreD[c] = c*d
        #print c,q,d,c*d
    #for k in scoreD:  print k, scoreD[k]
    return scoreD
#------------------------------------
# mL does not have the sequence under consideration
def initMotifScoreDict(mL,sD):
    #print '\n'.join(mL)
    #print '*********'
    N = len(mL)
    M = len(mL[0])
    
# for each position in the alignment
# for each nt at that position we need the score
# LL is a list of [nt at each index j]
    LL = list()
    for j in range(M):
        LL.append([mL[i][j] for i in range(N)])

# for each index j into mL and LL
# for each n in 'ACGT'
# put the score for the given count of nt
    
    msD = dict()
    for j in range(M):
        D2 = dict()
        L = LL[j]
        #print j,''.join(L)
        for n in nt:
            c = L.count(n)
            s = sD[c]
            #print n, c, s
            D2[n] = s
        msD[j] = D2
    return msD

Wednesday, June 17, 2009

Motif Discovery 4

In a series of posts I plan to explore an algorithm published by Lawrence et al for identifying new motifs in DNA sequence, in Science in 1993 (PMID 8211139). It will give me an opportunity to use Python to explore Bioinformatics, as well as learning more about motif discovery and about the Gibbs Sampler, which is a specialized form of Markov Chain Monte Carlo (MCMC). As always, exploration is limited by my simple understanding of the subject, but I post the results on the internet with the hope that someday, someone will find them useful.

Previously, I put up some code for an initial version, and showed some output, and since then I've been exploring its limitations. I hasten to point out that this first version does not yet use the Gibbs Sampler, but it does a decent job of finding motifs when the sequences to be searched are not too long. Another simplification was to use the Hamming distance to compute the score for an alignment. We'll fix that limitation soon.

The code that we have so far is a simple, greedy hill-climbing algorithm. To help visualize the method, suppose that we have three sequences, each with an embedded motif awaiting discovery, where the second and third sequences are fortuitously aligned correctly.

Suppose also that we have previously identified the start points of the motifs in sequences 2 and 3. To identify the motif in the first sequence, we would then slide the it back and forth trying to align the nucleotides shown in red, somewhat like a pin tumbler lock (image from Wikipedia).

We keep the indexes for the motifs in a list, and the greedy approach is to try all possible start points for sequence #1 between index 0 and len(seq) - len(motif). We update the alignment list with the index which gives the best score. Since we do not actually know the correct motif start points in sequences 2 and 3, we run the process repeatedly, choosing a new sequence to slide in each iteration of the loop. It seems like a combination of sequential indexes followed by a limited number of random choices works best.

A major limitation of this approach is that it gets trapped by local maxima. I will try to show some visualizations of this soon. In the meantime, here is a spiffed up version of the basic search function.

# the greedy sampler:  consider one seq at a time
# slide to find the best score v. current motif
# repeat R times
# aL holds the pos of the motifs in seqs in L
# L is a list of the seqs
# D holds N (# seqs), SZ (len seq), M (len motif)
def slideForBestAlignment(L,D,aL=None):
    N = D['N']
    M = D['M']
    SZ = D['SZ']
    R = D['R']
    if not aL:  aL = init_aL(D)
    for i in range(R):
        if i < 4*N:  currSeq = i%N
        else:  currSeq = random.choice(range(N))
        mL = getAlignment(L,M,aL)
        mL.pop(currSeq)
        subL = list()
        for j in range(SZ-M):
            seq = L[currSeq][j:j+M]
            score = scorefunction(mL + [seq])
            subL.append((score,j))
        subL.sort()
        maxS,maxI = subL[-1]        
        aL[currSeq] = maxI
    return currSeq,aL,maxS

Tuesday, June 16, 2009

Motif Discovery 3

I realize that reading other people's code is [fill in the blank with your favorite expletive] impossible, and mind-numbing as well. It's part of the reason I find it difficult to enjoy books like Jones & Pevzner and Kinser's book, Python for Bioinformatics.

But we need to have some code to examine if we want to talk about algorithms. So, if you would like to follow the discussion that's coming later, please paste the code below into a textfile on your desktop (my copy is named gibbs.py but you can do whatever you want). You will also need the code from here, saved as motif.py, and the code from here, saved as scoring.py. You can run it in the usual way by doing "python gibbs.py."

Here is the code, and output for the particular settings. Discussion coming soon.

# Lawrence et al. Gibbs Sampler algorithm
# version 0.1
# the alignment is stored as a list of 
# the motif's start point in each sequence

import random,sys,time
from motif import getTestSet
from scoring import hammingScore
#----------------------------------------------
def showStartSeqs(sL):
    for i,m,seq in sL:
        print str(i).rjust(5),m,'  ',
        print seq[:50],
        if len(seq) > 50:  print '...'
        else:  print
#----------------------------------------------
# aL holds positions of the motifs in seqs in L
# assign elements in aL randomly to start
def init_aL(D):
    N = D['N']
    SZ = D['SZ']
    M = D['M']
    aL = list()
    for i in range(N):
        aL.append(random.choice(range(SZ-M)))
    return aL
#----------------------------------------------
# utility method
def showMotifAlignment(L,D,aL):
    M = D['M']
    retL = list()
    for i in range(len(aL)):
        j = aL[i]
        retL.append(L[i][j:j+M])
    return '\n'.join(retL)
#----------------------------------------------
# given a list of sequences,
# alignment start points
# index of one sequence to leave out
# and motif length

def getMotifSeqs(L,M,aL,currSeq):
    retL = list()
    for i,seq in enumerate(L):
        if i == currSeq:  continue
        j = aL[i]
        retL.append(seq[j:j+M])
    return retL
#----------------------------------------------
# the sampler:  
# consider one seq at a time
# slide to find the best score v. current motif
# repeat R times
def slideSeqs(L,aL,D,v=False):
    N = D['N']
    M = D['M']
    SZ = D['SZ']
    R = D['R']
    currSeq = 0
    for j in range(R):
        if v:  print aL
        sL = getMotifSeqs(L,M,aL,currSeq)
        
        # for currSeq find best match to alignment
        maxI = 0
        maxS = 0
        for i in range(SZ-M):
            seq = L[currSeq][i:i+M]
            score = hammingScore(sL + [seq])
            if score > maxS:
                maxS = score
                maxI = i
        aL[currSeq] = maxI

        if v:  print maxS 
        #for e in sL:  print e
        if v:  print 'currSeq', currSeq, 
        if v:  print 'best', maxI
        if v:  print 'aL:', aL
        #print L[currSeq][maxI:maxI+M]
        
        currSeq = random.choice(range(N))
        if v:  print '-' * 30
    return currSeq,aL,maxS
#----------------------------------------------
def main(D):
    print 'numSeq ='.rjust(15),
    print str(D['N']).rjust(4)
    print 'motif length ='.rjust(15),
    print str(D['M']).rjust(4)
    print 'seq length ='.rjust(15),
    print str(D['SZ']).rjust(4)
    print 'slide cycles ='.rjust(15),
    print str(D['R']).rjust(4)
    print '# runs ='.rjust(15),
    print str(D['T']).rjust(4)
    
    # returns i,m,seq
    # N = D['N']
    startL = getTestSet(D)
    # save the actual sequences separately in L               
    L = [t[2] for t in startL]

    # r for results, L for list
    rL = list()
    # run sliding test from independent starts
    
    print '\nruns:'
    for i in range(T):
        # display something to show progress
        if not i:  print '.',
        if i:
            if not (i+1)%10:  
                print str(i+1).rjust(3)        
            else:  print '.',
        aL = init_aL(D)
        # do the work here
        rL.append(slideSeqs(L,aL,D))
    print '\n\n'

    # extract all slider end scores to a new list
    sL = [t[2] for t in rL]
    # scan for final result with best score
    maxI = -1
    maxS = 0
    for i,score in enumerate(sL):
        if i and not i % 8:  print
        print score,
        if score > maxS:
            maxS = score
            maxI = i 
                
    currSeq,aL,score = rL[maxI]
    print '\nbest score =',
    print score
    n = sL.count(score)
    print 'recovered',n,
    if n > 1:  print 'times\n\n'
    else:  print 'time\n\n'
    
    print 'motif found:'
    print aL
    print showMotifAlignment(L,D,aL)

    # display original motif
    print '\n\noriginal:'
    print hammingScore([t[1] for t in startL])
    showStartSeqs(startL)
    print

if __name__ == "__main__":
    random.seed(157)
    start = time.time()    
    N = 10     # number of sequences
    M = 10     # motif length
    SZ = 50    # sequence length
    R = 25    # cycles of basic sliding
    T = 100     # number of independent starts
    D = {'N':N,'M':M,'SZ':SZ,'R':R,'T':T }
    main(D)
    t = time.time() - start
    print 'elapsed time =', round(t,2), 'sec'

Output:

       numSeq =   10
 motif length =   10
   seq length =   50
 slide cycles =   25
       # runs =  100

runs:
. . . . . . . . .  10
. . . . . . . . .  20
. . . . . . . . .  30
. . . . . . . . .  40
. . . . . . . . .  50
. . . . . . . . .  60
. . . . . . . . .  70
. . . . . . . . .  80
. . . . . . . . .  90
. . . . . . . . . 100



235 227 212 257 254 229 212 220
206 226 228 231 191 208 212 236
198 201 242 211 202 210 171 216
218 209 237 206 201 180 209 206
212 256 239 257 207 207 242 258
202 252 210 192 203 249 199 197
192 211 238 218 221 195 259 222
225 265 250 222 214 214 218 191
215 208 215 249 200 227 211 239
233 207 204 218 196 215 218 196
228 205 221 221 213 207 201 208
239 232 252 245 206 252 214 213
192 213 204 210 
best score = 265
recovered 1 time


motif found:
[35, 17, 24, 32, 18, 5, 33, 24, 18, 36]
GCACATAGCA
GAATTTGAAA
GAAAACACCT
GAACATAAGA
GAAAATAAAA
GATCATAATA
GGACATGCCA
GAAAATTCCA
GAAGTTAACT
GAATATAAGA


original:
252
   35 GCACATAGCA    TCGAAGACAAACTGCCGGCCTGGGTCCCCGAACGAGCACATAGCAATCAT
   17 GAATTTGAAA    TGCGACTGGGTGTTTCCGAATTTGAAATTCTCCACGCTATCGACTCTCAT
   19 GTGCCGAAAA    AATTGGCTGAGTGAATCGAGTGCCGAAAACACCTAGAATTATAGGGTACA
   32 GAACATAAGA    GATGGGCTCAACGTGCCGGGGATGATGGCACGGAACATAAGAGAGCGGAG
   18 GAAAATAAAA    ACGCAACGTACAATCACTGAAAATAAAACAATCCCGCTAGACAACTTCCG
    5 GATCATAATA    TGTATGATCATAATAAACGTACTCGCTCAACTAACGGTTACTGCTTAAAG
   33 GGACATGCCA    TAGTAAGCATAATTGTTACCTGGTGAAATCTGCGGACATGCCAGCGCTGA
   39 GACCTTAAAC    AGCAGGACACTCAGGCGAATATTCGAAAATTCCAAACCGGACCTTAAACT
   18 GAAGTTAACT    ATTTGATGTACTGGTGTAGAAGTTAACTGAAAGACGGTATAGAATGTATC
   36 GAATATAAGA    CTGACCCCATTTGTACTGCCCCCAAGGTACCAGTAGGAATATAAGAGTCC

elapsed time = 17.05 sec

Motif Discovery 2

To help with exploration of the Gibbs sampler, I wrote a small module that generates new motifs, mutates them, and embeds them in random sequence. The critical variables are N (the number of sequences and motifs), M ( the length of the motif), and SZ (the length of the sequence in which the motif is embedded at a random position). The frequencies at which each position varies can be adjusted. This code needs to be on the Python search path under the filename motif.py for use as described in the next post.

# generate a new motif length = SZ
# howmany = N;  average match = f
import random
nt = 'ACGT'

# given a master motif s and 
# a list of frequencies of conservation
# generate a variant
def mutateMotif(m,freqL):
    retL = list()
    for c,f in zip(m,freqL):
        if random.random() > f:
            retL.append(random.choice(nt))
        else:
            retL.append(c)
    return ''.join(retL)

def getMotifList(N,M,SZ):
    L = list()
    # make a new random master motif s
    for i in range(M):
        L.append(random.choice(nt))
    m = ''.join(L)
    
    # positions are conserved at these freqs
    freqL = [0.9,0.8,0.7,0.6,0.5]
    freqL *= M/len(freqL) + 1
    # do mutagenesis
    L = list()
    for i in range(N):
        L.append(mutateMotif(m,freqL))
    return m,L

def embedMotif(m,M,SZ):
    # embed motif in random sequence of len SZ
    L = [random.choice(nt) for i in range(SZ)]
    i = random.choice(range(SZ-M))
    L[i:i+M] = list(m)
    return i,''.join(L)
    
def show(m,seq):
    i = seq.find(m)
    j = i+len(m)
    print i
    print seq[:i] + '\n'
    print seq[i:j] + '\n'
    print seq[j:] + '\n'
        
# we generate a list of motif, seq pairs
# each motif derived from a master motif

def getTestSet(D=None):
    if not D:
        D = {'N':10,'M':10,'SZ':50 }
    N=D['N']    # number of seqs
    M=D['M']    # length of motif
    SZ=D['SZ']  # length of seqs
    
    s,L = getMotifList(N,M,SZ)
    retL = list()
    for m in L:  
        i,seq = embedMotif(m,M,SZ)
        retL.append((i,m,seq))
    return retL

if __name__ == '__main__':
    #random.seed(157)
    L = getTestSet()
    for i,m,seq in L:
        #show(motif,seq)
        print str(i).rjust(3),m,seq
    print
    i,m,seq = L[0]
    show(m,seq)

Sample output:

 21 CGATGCCCCA GCATATAGTTGATTGCGAACCCGATGCCCCAGTCTCCGACGACTTGCCAA
 23 CGCTGCCTTG CACGCTCGTACGATTAGGTAGGGCGCTGCCTTGACTTAAGGAGTAAGCCG
 26 CGGTGCCTCT TGTTAACTAATAGGGAGCAGCGAGTTCGGTGCCTCTCCAGCGTTTCTGTA
 15 CGGCGCCTCG TTGTGCTGAAACTCACGGCGCCTCGCTGGATGCGAAGTGCCTACTCGATA
 37 CGGTTCCTCT TAGTTTCTAATAGCCAGCGTCCTAGTGGCATTCGGGACGGTTCCTCTACG
 36 CGGACCGTCT CTAGCTAAGCACTGCATTGGAAGGACAGCGGAATCGCGGACCGTCTATGC
 22 CGTTGCCTTA CGATAAGGCGACCCGGATGCACCGTTGCCTTAGCAATGCCTTGTCGCCCA
 37 CGCTGCCTCG GCGTCGAATCGCAATCTATAGTTTTTAGTATCATTGACGCTGCCTCGCGT
 15 CGGACTCTCC CTCCCTGGCCGGACACGGACTCTCCCTTAGTCGGCCAAACTGCTAAAGAG
 24 CGGTGCCTCG TTACATTAGGTTGTAGTCATTAGACGGTGCCTCGGTACAGGATTTACCAG

21
GCATATAGTTGATTGCGAACC

CGATGCCCCA

GTCTCCGACGACTTGCCAA

Monday, June 15, 2009

Motif Discovery

I've been looking at Jones & Pevzner's Introduction to Bioinformatics Algorithms again. I have to admit I've been skimming much of the code rather than trying to understand it---I'll have more to say about why I think that is some other time. However, they mention an approach to the problem of motif finding that struck a chord with me. I remember seeing it in a rather famous Science paper by C.E. Lawrence (PMID 8211139) whose title includes the phrase "Gibbs sampling strategy." This paper is special for me partly because when I first looked it up I couldn't understand it very well, but when I re-read it recently in connection with my course in Bioinformatics it seemed pretty easy. I thought I would be fun to explore the topic further here.

The problem, as I'm sure you know, is to align short sequences which are repeated in different places in the genome. For example, they could be binding sites for DNA-binding proteins upstream from promoters. These sites, or motifs, are not completely conserved. They may also contain non-conserved spacers between blocks of conserved nucleotides. In the worst case, the spacers may be heterogeneous in length. Shown here are some invented motifs generated by a Python program I will describe later. The graphic comes from Clustal. As you can see, positions within the motif have variable conservation. These motifs are then embedded in random sequence, and the second image shows an example of this.

In this example, each line represents a different upstream region with the motif embedded in it. There are 10 lines showing 10 regions of the genome, each with a single example of the motif. The goal is to slide the sequences back and forth one at a time in order to line up the red nucleotides of the motif.

(I wish I knew how to make the two sequences have the same resolution. They are the same in the original images but blogger pastes them into a box of equal size for the page, and if I increase the size of the second one to recover the resolution, it looks like crap. Suggestions would be welcome).

Lawrence et al. formalize the search in the following way. Start by choosing a particular set of columns from the current alignment as the prospective motif. (This requires having pre-chosen a value for the width of the motif we're looking for). Now, an individual sequence is chosen at random and moved back and forth. Each position is tested to find the best alignment of this one sequence against the prospective motif. It will happen eventually that two examples of the motif line up with each other and the other sequences fortuitously generate a decent overall score. Once two are aligned, the rest will follow easily.

I will leave Markov Chain Monte Carlo (MCMC) for later. But notice that we can view this as a maximization problem in N-dimensional space (the N sequences). The variables would be the position at which the motif is found in each sequence, and the value of the function is the result of a scoring system. As I understand a Gibbs Sampler,

To quote from this review (pdf),

The Gibbs sampler (introduced in the context of image processing by Geman and Geman 1984), is a special case of Metropolis-Hastings sampling wherein the random value is always accepted (i.e. α = 1). The task remains to specify how to construct a Markov Chain whose values converge to the target distribution. The key to the Gibbs sampler is that one only considers univariate conditional distributions — the distribution when all of the random variables but one are assigned ﬁxed values. Such conditional distributions are far easier to simulate than complex joint distributions...To introduce the Gibbs sampler, consider a bivariate random variable (x, y)...The sampler starts with some initial value y0 for y and obtains x0 by generating a random variable from the conditional distribution p(x | y = y0 ). The sampler then uses x0 to generate a new value of y1 , drawing from the conditional distribution based on the value x0 , p(y | x = x).

That certainly sounds like what we're doing. We change one N-dimensional variable (slide one sequence) at a time. There is more to say here but let me close this post by considering the scoring function. In my initial attempts at this method, I'll be looking at DNA rather than protein, and rather than use a measure of entropy, I decided to use a something called the Hamming distance. According to Wikipedia:

In information theory, the Hamming distance between two strings of equal length is the number of positions for which the corresponding symbols are different. Put another way, it measures the minimum number of substitutions required to change one into the other, or the number of errors that transformed one string into the other.

For example, the Hamming distance between these two sequences is equal to the number of mismatches (5).

The trick is that we're considering a set of N motifs. To compute the total score, we must do it pairwise between all combinations. I paste the sequences together and compute the score for the whole thing. Here is the code:

def hammingScore(L,v=False):
    a = list()
    b = list()
    for i in range(len(L)-1):
        for j in range(i+1,len(L)):
            a.append(L[i])
            b.append(L[j])
    a = ''.join(a)
    b = ''.join(b)
    diffs = 0
    if v:  print 'a', a
    if v:  print 'b', b
    for c1,c2 in zip(a,b):
        if not c1 == c2:  diffs += 1
    #print len(a)
    return len(a) - diffs

Sunday, June 14, 2009

Bayes 7: binomial proportion example

In this post, we'll work through a simple binomial proportion problem using a Bayesian approach, following Bolstad, compare it with the classical frequentist method, and then check our results against those obtained with the R functions provided in the Bolstad package.

Suppose we want to estimate the proportion of a population who plan to vote for the incumbent in the next election. We take a random sample of n = 200 individuals from the population and observe:

y = 68 (individuals answering "yes")
n-y = 132 (individuals answering "no")
n = 200

The distribution of y is binomial(n=200,p).

We have no prior information, so we should probably choose a beta(1,1) distribution for the prior. We use the updating rule to generate the posterior probability density for p, it is:

beta(1+y,1+n-y)
beta(69,133)

We calculate the mean and variance of the distribution:

a = 69
b = 133:
m = a / a+b
= 69 / 202 = 0.3416

Var = a*b / ((a+b)**2 * (a+b+1))
= 69*133 / ((202)**2 * (203))
= 9177 / (40804 * 203)
= 0.0011

And the standard deviation is:

s = sqrt(Var)
= 0.0333

Since a and b are large enough (roughly, > 10) we can approximate this beta distribution by a normal distribution with the same mean and variance. Then, we calculate the estimated 95 % credible interval simply as

95 % CI = m -/+ 1.96 * s
= 0.3416 -/+ 1.96 * 0.0333
= 0.3416 -/+ 0.0653
= (0.2763, 0.4069)

Applying these values to a total n = 200, we find:
m = 68.3
s = 6.6
95 % credible interval (CI) = (55.26, 81.38)

The standard frequentist analysis for the same problem would give a maximum likelihood estimate for p as:

p = y / n = 0.34

mean = n*p = 68

Var = n * p * (1-p)
= 200 * 0.34 * 0.66
= 44.88

s = 6.7

We can approximate this binomial distribution as a normal distribution with the same mean and variance. We calculate the 95 % confidence interval as:

95 % confidence interval (CI) = m -/+ 1.96 * s
= 68 -/+ 1.96 * 6.7
= 68 -/+ 13.132
= (54.9, 81.1)

Which is not much different than with the non-informative prior. In fact, according to my understanding, they should be the same, so perhaps I am doing something wrong.

The difference between the two approaches is that with the Bayesian analysis we are justified in saying that the probability is 95 % that the true mean lies within the 95 % credible interval. Actually, we can use the posterior distribution to calculate the probability that it lies in any particular interval.

In contrast, the frequentist perspective is more subtle and less general. Given a pre-determined p value, the CI either contains the mean or it does not, with significance at the level = p.

Here is the R code using the Bolstad package:

library(Bolstad)
binobp(68,200,1,1)

Posterior Mean           :  0.3415842 
Posterior Variance       :  0.0011079 
Posterior Std. Deviation :  0.0332852 

Prob. Quantile 
------ ---------
0.005 0.2591665
0.01 0.2666906
0.025 0.2779134
0.05 0.287724
0.5 0.3410604
0.95 0.3972323
0.975 0.4082264
0.99 0.4210788
0.995 0.4298666

One great thing about this package is that we can easily use other priors. For example we could imagine a discrete, trapizoidal-shaped prior with a flat top and corners at: 0.1,0.2,0.6,0.8.

library(Bolstad)
pi = seq(0,1,length=11)
pi.prior = c(rep(0,2),rep(0.1,5),rep(0,4))
binodp(68,200,pi=pi,pi.prior=pi.prior)

Saturday, June 13, 2009

Bayes 6: normalized beta distribution

One thing I forgot to mention when talking about the beta distribution and its use in Bayesian analysis is the normalizing constant. The beta distribution with shape parameters a and b and variable p is:

p**(a-1) * (1-p)**(b-1)

where the range of the function is from 0 to 1. To be a probability density function, of course, the total density over this range must equal 1.

So, we need to find a constant k which normalizes the beta distribution function. If you think about it for a minute, it should be obvious that k equals the inverse of the integral of the pdf over its range. It turns out that k can be expressed in terms of the gamma function:

The gamma function is a complex pdf in its own right. But a simple fact to remember about it is that if a is a positive integer (as it will be for our beta distributions), then

The fact that k is expressed in terms of the gamma function is important for the derivation of expressions for the mean and variance and updating functions of the beta distribution. But we don't need to think about them to use the beta distribution for analysis of binomial proportions.

Bayes 5: more on beta distribution

There are some useful properties of the beta distribution which simplify the math for Bayesian analysis, as discussed in Bolstad, Introduction to Bayesian Statistics.

The mean and variance of a beta distribution with shape parameters a and b are easily shown to be:

m = a / a+b
Var = a*b / ((a+b)**2 * (a+b+1))

Probably the most important beta distribution is the one which corresponds to a uniform prior, with equal probability over all intervals of the same size between 0 and 1. This distribution has a = 1, b = 1, and has m = 1/2, Var = 1/12.

Other beta distributions weight the probability to one side of the interval, e.g. a = 1, b = 2 gives a linear pdf with density = 2 at x = 0, decreasing to density = 0 at x = 2.

Another helpful property is that the beta distribution can be approximated by a normal distribution of the same mean and variance for sufficiently large a and b. This allows the use of standard tables to calculate the probability density over different intervals. The density can also be calculated exactly.

The likelihood function for a binomial proportion (n trials, y successes, probability of success p) has the form:

p**y * (1-p)**(n-y)

This is a beta distribution. If we have a beta(a,b) distribution for the prior, and then observe y successes in n trials, there is a simple updating rule which generates the posterior probability:

p**(a-1+y) * (1-p)**(b-1+n-y)

This is a beta(a+y,b+n-y) distribution. The case of the uniform prior beta(1,1) gives a posterior

p**(y) * (1-p)**(n-y)

This is a beta(y+1,n+1) distribution.

There is a simple interpretation. Choosing a and b for the prior distribution amounts to assessing the likely number of successes = a and failures = b in a+b trials. After actually observing y successes and n-y failures in n trials, we have an updated beta posterior with an easily computed mean and variance as shown above.

If we choose a uniform prior a = b = 1, then the expression for the prior is:

beta(a,b) = p**(a-1) * (1-p)**(b-1)
= 1

and the posterior is simply equal to the likelihood divided by the marginal probability. So for a uniform prior, the Bayesian approach gives the same answer for the mean as a maximum likelihood approach.

(Note to the reader: I am really at the limits of my understanding in these posts. I'm writing them as much for me as for you. So, if you find an error, please leave a comment. Thanks. And caveat lector.)

Bolstad (and James Curran) have made an R package avaiblable (link). I'll show two examples. Suppose we observe 5 successes in 15 trials. In one case we have a uniform prior and in the other we choose a beta(1,4) prior, which is weighted more toward the low values for p.

library(Bolstad)
#library(help=Bolstad)
par(mfrow=c(1,2))
binobp(5,15,1,1)

Posterior Mean           :  0.3529412 
Posterior Variance       :  0.0126874 
Posterior Std. Deviation :  0.1126385

binobp(5,15,1,4)

Posterior Mean           :  0.3 
Posterior Variance       :  0.01 
Posterior Std. Deviation :  0.1

We can see that the uniform prior gives a slightly fatter distribution, while the prior weighted to low values shifts the mean from 0.35 to 0.3. But with this much data (15 trials) the influence of the prior is really quite small.

Friday, June 12, 2009

Bayes IV: beta distribution

The second book I recommend to help further understanding of Bayesian methods is William Bolstad, Introduction to Bayesian Statistics.

I'm planning a series of posts explaining my low-level understanding of what he covers in the book. But first, I'm going to discuss some probability distributions or pdfs (probability density functions).

We can generalize the Bayesian approach as follows. Typically, we have some observed data (D) and one of a series of hypotheses or models. Often the parameters of those models are not discrete but continuous. The standard form of Bayes rule would be that the posterior ≈ likelihood x prior.

P(H | D) = P(D | H) * P(H) / P(D)

where P(D) is a complex term involving the probability of the observed data under all hypotheses. Often, rather than write H for hypothesis, theta is used as a symbol for the adjustable parameters of a general model. Where previously we summed over some discrete models to get P(D), in real-life situations we would be integrating over the parameters, some or all of which are continuous.

As I understand it, there are two general problems in applying the Bayesian approach. The first is the need to evaluate the likelihood of the model, P(D | H), and multiply it by the prior. The second is the need to integrate to determine the denominator in the equation above. These needs introduce complexities, which are in general solved by tricks.

As explained in detail in Bolstad, it is very helpful to have a function for the prior which, when multiplied by the likelihood, gives a function in the same family as the likelihood function. So, for example, in problems of binomial proportion with parameter p = the probability of success on any one trial, with y successes in n trials, the likelihood function is of the form p**y * (1-p)**(n-y)

In this case, it is enormously helpful if the prior distribution has the same form, which is called the beta distribution.

beta (a,b) = p**(a-1) * (1-p)**(b-1)

where a and b are shape parameters, and p runs between 0 and 1. Here is R code to demonstrate beta distributions with various shapes, and the resulting plots.

Shape parameters: a = 1, b = 0.5,1,..7

x=seq(0,1,length=1001)
par(pch=16)
colors=rainbow(7)
L=c(1,2,3,4,5,6,7)
plot(x,dbeta(x,1,0.5),
  ylim=c(0,5),col='black')
for (j in 1:length(L)){  
  points(x,dbeta(x,1,L[j]),
  col=colors[j])  }

Shape parameters: a = b = 0.5,1,..6

x=seq(0,1,length=1001)
par(pch=16)
colors=rainbow(7)
i=1
L=c(1,2,3,4,5,6)
plot(x,dbeta(x,0.5,0.5),
  ylim=c(0,5),col='black')
for (j in 1:length(L)){  
  points(x,dbeta(x,L[j],L[j]),
  col=colors[j])  }

Shape parameters: b = 2*a = 0.5,1,..6

x=seq(0,1,length=1001)
par(pch=16)
colors=rainbow(7)
i=1
L=c(1,2,3,4,5,6)
plot(x,dbeta(x,0.5,1),
  ylim=c(0,5),col='black')
for (j in 1:length(L)){
  y=L[j]
  points(x,dbeta(x,y,y*2),
  col=colors[j])  }

Bayesian analysis, part III

Let's continue the discussion of Bayes rule as described in Dennis Lindley's book, Understanding Uncertainty. As he explains, there is a simple method for updating probabilities which uses the likelihood odds ratio. Remember Bayes rule as applied to the two urns containing red and white balls, where U1 has 2/3 red balls and U2 has 1/3 red balls:

P(U1 | r) = P(r | U1) P(U1) / P(r)

and

P(r) is the total or marginal probability of r under all models. Here:

P(r) = P(r | U1) P(U1) + P(r | U2) P(U2)

With more than two models, the term P(r) can become hard to calculate.

The odds ratio approach simplifies the analysis by consolidating all alternative models to U1 into a single one. So we can write U for the model (formerly U1) and ~U for the negation of U (i.e. U2 is true if that is the only alternative):

P(U | r) = P(r | U) P(U) / P(r)
P(~U | r) = P(r | ~U) P(~U) / P(r)

P(r) = P(r | U) P(U) + P(r | ~U) P(~U)

We next calculate the ratio of these posterior probabilities for the two cases U and its complement ~U (noticing that the P(r) term cancels):

P(U | r) / P(~U | r) = P(r | U) P(U) / P(r | ~U) / P(~U)

We call this the odds of U | r:

o(U | r) = P(r | U) P(U) / P(r | ~U) / P(~U)

Rearranging:

o(U | r) = [P(r | U) / P(r | ~U)] * [P(U) / P(~U)]

The first term is the ratio of likelihoods, the second term is the prior odds on U1. If we have a uniform prior for the probabilities:

P( U) = 1/2
P(~U) = 1/2

then we consider the initial odds on U:

o(U) = 1

We have these likelihoods:

P(r | U) = 2/3
P(r | U2) = P(r | ~U) = 1/3

We can calculate the likelihood ratio:

P(r | U) / P(r | ~U) = 2/3 / 1/3
= 2

Now, having observed a red ball, we update as follows:

o(U | r) = ( P(r | U) / P(r | ~U) ) * ( P(~U) / P(U) )
= P(r | U) / P(r | ~U) * 1
= 2 * 1
= 2

Upon observing a second red ball, we update again in a simple way:

o(U) = odds ratio o(U) * likelhood ratio
= 2 * 2
= 4

We convert from odds to probabilities (and back) by remembering that:

o = p / 1 - p

So, after observing a single red ball drawn from the urn, we had

o = 2 = p / 1 - p
2 - 2p = p
2 = 3p
p = P(U | r) = 2/3

And after observing a second red ball drawn from the urn,

o = 4 = p / 1 - p
4 - 4p = p
4 = 5p
p = P(U | r) = 4/5

These match the results seen with the straightforward application of Bayes rule, but are obtained more easily.

Notice that if we draw a white ball, the likelihood ratio is

P(U | w) / P(~U | w) = 1/3 / 2/3
= 1/2

So, if we drew a red and a white ball in sequence, the odds would be:

o(U) = 2 * 1/2 * 1 = 1

And in any sequence of draws, all that really matters is the excess draws of one type of ball, not the order in which they appear.

Thursday, June 11, 2009

Bayesian analysis, part II

In Dennis Lindley's book, Understanding Uncertainty, he often uses the example of a jar or urn which contains some number of balls (say, 100). The balls are marked---red or white---with an unknown proportion of each.

Consider two urns, each having some known number of red balls; in the interesting case these are not the same. We do not know whether the particular urn before us is U1 or U2, and we wish to determine the probability that it is U1 or alternatively, U2.

We observe some data, e.g. we draw balls one by one from the urn. (We do this with replacement). We infer the probability that the particular urn is in fact U1, P(U1), and by subtraction calculate P(U2) which equals 1 - P(U1). These values naturally will change as we acquire data.

In a Bayesian analysis, we must first assign a value to P(U1) before we see any data. This value is called the prior. For example, lacking any relevant information, we might assign a uniform prior in which P(U1) = P(U2).

Suppose we know that the proportion of red balls in urn U1 is 66 out of a total of 100, i.e.:

P(r | U1) = 2/3

Suppose also that

P(r | U2) = 1/3

Now, we draw one ball from the urn before us and observe that it is red. What is P(U1) after observing this data? Following Bayes rule or theorem we write:

P(U1 | r) = P(r | U1) P(U1) / P(r)

P(U1) is the prior probability of U1.

As mentioned above, if we don't have any information about which urn is being sampled, a typical Bayesian analysis would assign P(U1) = 1/2. That is, our belief about which urn is chosen before seeing any data is P(U1) = P(U2) = 1/2. Assigning the same probability to all possible models is described as using a "flat" or non-informative prior.

P(r | U1) is the probability of observing this data if the urn is really U1. This is referred to as the likelihood of U1. It is the probability of the observed data, given the model U1.

The denominator of the expression is more complicated than it appears at first. P(r) is the probability of observing a red ball under all possible models.

P(r) = P(r | U1) P(U1) + P(r | U2) P(U2)

Plugging into the main equation:

P(U1 | r) = P(r | U1) P(U1) / P(r | U1) P(U1) + P(r | U2) P(U2)

Let's do the math. Based on the known populations of the two urns, the likelihoods are:

P(r | U1) = 2/3
P(r | U2) = 1/3

The prior is:

P(U1) = P(U2) = 1/2

The marginal probability P(r) is:

P(r) = P(r | U1) P(U1) + P(r | U2) P(U2)
= 2/3 * 1/2 + 1/3 * 1/2
= 1/3 + 1/6
= 1/2

The result we seek is:

P(U1 | r) = P(r | U1) P(U1) / P(r)
= 2/3 * 1/2 / 1/2
= 1/3 / 1/2
= 2/3

and

P(U2 | r) = 1 - P(U1 | r)
= 1 - 2/3
= 1/3

The result P(U1 | r) is called the posterior probability that the urn is U1, having seen the data of a red ball.

Before seeing any data, we assessed the prior probability P(U1) as equal to 1/2. After obtaining a red ball drawn from the urn being tested, we have an increased estimate of P(U1)---the probability that the urn is the one having a higher proportion of red balls---P(U1) = 2/3 and likewise we decrease P(U2) to 1/3.

What drives non-Bayesians crazy about this analysis is the dependence of the estimated probability on the prior. For example, we might have had a prior belief that U2 was twice as likely as U1 and thus might now have P(U1) = P(U2).

Prior beliefs depend on the individual, and in this system they are immune from criticism. However, a striking result seen in doing realistic Bayesian calculations is that observed data quickly overwhelms the influence of the prior. The prior only makes a difference when data are scarce.

Suppose we now draw a second ball from the urn being tested, and find that it is again a red ball. What happens to P(U1)?

First, we will have a new prior, which now depends not only on the probabilities assigned initially, but also the adjustments made as a result of the previously observed data. The new prior = the old posterior.

P(U1) = 2/3; P(U2) = 1/3

The likelihoods are still the same:

P(r | U1) = 2/3
P(r | U2) = 1/3

Finally, the marginal probability has also changed, because we have updated P(U1) and P(U2).

The marginal probability P(r) is now:

P(r) = P(r | U1) P(U1) + P(r | U2) P(U2)
= 2/3 * 2/3 + 1/3 * 1/3
= 4/9 + 1/9
= 5/9

The posterior probability P(U1 | r):

P(U1 | r) = P(r | U1) P(U1) / P(r)
= 2/3 * 2/3 / 5/9
= 4/9 / 5/9
= 4/5

and

P(U2 | r) = 1 - P(U1 | r)
= 1 - 4/5
= 1/5

There are some interesting twists of this story that I would like to pursue in another post. It turns out that we would obtain the same result if we had instead taken the two data observations as one compound event. Another thing is that updating the posterior is done even more simply by using the likelihood ratio P(r | U1) / P(r | U2). And last, if the second ball drawn had been white the posterior probability P(U1) would go back to 1/2.

(I updated this post with a different set of probabilities, which I hope will be a clearer example).

Bayesian analysis

I've been trying to understand more about Bayes and the methodology for inductive reasoning named after him. Since I never paid any attention to statistics until recently, I didn't know anything about this area. (Of course, most statistics classes for biologists wouldn't mention approaches which invoke his name anyway). But, some modern analysis in phylogenetics relies on Bayesian methodology, and there are many other applications in Bioinformatics. Another attraction was the realization that the underpinnings of "classical" statistics (i.e. 20th century "frequentist" statistics) may be a bit shaky. I was certainly surprised to find that the "p" in p-value is usually, though perhaps not always, misleading since it seems to imply "probability," which is not a correct interpretation.

But let's start by considering a thought experiment given by Peter Norvig in one of his outstanding posts:

...suppose I have a test machine that determines whether the subject is a flying leprechaun from Mars. I'm told the test is 99% accurate. I put person after person into the machine and the result is always negative. Finally one day, I put Tom Hanks into the machine and the light comes on that says "flying leprechaun." Would you believe the machine? Of course not: that would be impossible, so we conclude that we just happened to hit the 1% where the test errs. We find it easy to completely reject a test result when it predicts something impossible (even if the test is very accurate); now we have to train ourselves to almost completely reject a test result when it predicts something almost completely impossible (even if the test is very accurate).

Or consider a second example (from Andrew Moore): we wish to determine the relationship between headache and flu (for me). We're considering a universe of possible "events", or an event space, with a total probability of 1. In this universe, it sometimes happens that when I wake up I have a headache. Let us suppose that occurs 10% of the time. Represent this event by the cicle in the event space labeled A in the figure. (I know it's more than 10% of the total, I plead artistic license). It also happens that on occasion I have the flu. The probability of this event is represented by the size of the pink circle. Sometimes I have both flu and headache, labeled A+B and this intersection of the two circles is colored magenta.

We want to know the predictive value of headache for whether I am likely to have the flu, given that it is already known that I have a headache. That is, what is P(flu | headache). The value we seek is the ratio of two areas, the area of the region labeled A+B to the area of the circle labeled A. It is important to notice that just knowing the values for P(A) and P(B) is not enough. We also need to know the relationship between the two circles, how much they overlap.

The other thing to notice is that

P(B | A) = P(A+B) / P(A)

If we know A is true, then the event space of true things with total probability = 1 shrinks to the circle labeled A, and within that restricted event space, the probability of A+B is the ratio of the
of the magenta to the maroon circle. Rearranging:

P(A+B) = P(B | A) P(A).

Since there is nothing special about the labels or the sizes of the circles it is also true that

P(A+B) = P(A | B) P(B).

Now we have two expressions both equal to P(A+B), so they are also equal to each other:

P(A | B) P(B) = P(B | A) P(A)

P(A | B) = P(B | A) P(A) / P(B)

This is the standard form of Bayes rule.

Let's try to get real. Suppose there is a medical condition or disease (D), together with its complement, no disease (~D). There is also a diagnostic test which has characteristic (known) error rates: the rate of false-positives (FP) and the rate of false-negatives (FN).

In order to interpret our test results, these two values are not sufficient. In Bayesian terms, what we want to know is P(D | +), the probability of disease given a positive test result. We want to know the probability of a hypothesis, given some data. (Notice this is the opposite of the way you first learned about probability, where we predict what data we will see, given a hypothesis). Going back to the equality statement above and switching notation, we have

P(D | +) = P(+ | D) P(D) / P(+)

We already know P(+ | D), it is 1 - FN. We haven't talked about either P(D) or P(+) yet. P(D) is the probability or incidence of disease (think back to the flying leprechaun). P(+), the probability of a positive result, looks hard to calculate. But in this case it's easy. Either we have the disease or we do not. In either case, we may have a positive result. The total probability of being positive is the sum of these two events:

P(+) = P(+ | D) P(D) + P(+ | ~D) P(~D)

P(D | +) = P(+ | D) P(D) / P(+ | D) P(D) + P(+ | ~D) P(~D)

Again, we are given P(+ | D) as 1 - false-negative rate.
We are given P(+ | ~D) as false-positive rate.
We know P(D), the incidence of disease.
We calculate P(~D) as 1 - P(D).

Rather than do the math explicitly, I will use another device called a contingency table. Suppose false-negatives = false-positives = 1 %. And suppose we know the incidence of disease is also 1 %. If we consider 10000 individuals, we can calculate as follows:

Look at the sums in the third column. The first two are the marginal probabilities P(+) and P(-) (after dividing by 10000). Notice that P(D | +) = 0.5. The probability of disease given a positive test result is only 50 %. This value depends on P(D) as explained above. In particular, if P(D) is 0.01 %, then nearly all positive test results will be from individuals without disease. Everything depends on the probability of actually encountering a flying leprechaun.

To learn more about Bayes, I could recommend that you start as I did by going to Wikipedia, but I find their "introductory" articles on mathematical topics very hard going. Instead, I suggest that you start with a book by Dennis Lindley entitled Understanding Uncertainty. If you need to be convinced that he is worth the effort, you can thumb through the book on Amazon, or you can check out:

Inference for a Bernoulli Process (A Bayesian View)
Author(s): D. V. Lindley and L. D. Phillips
Source: The American Statistician, Vol. 30, No. 3 (Aug., 1976), pp. 112-119.

If you can access jstor, get it here.

Dynamic Programming

I've started reading Jones & Pevzner, An Introduction to Bioinformatics Algorithms.

These authors spend substantial time on a classic computer science method called "dynamic programming" (invented by Richard Bellman). It is widely used in bioinformatics. For example, sequence alignment algorithms such as Needleman-Wunsch and Smith-Waterman are dynamic programming methods. The key is to have a problem that can be broken into stages or steps; the problem is attacked from one end or the other and the result for each stage is computed in turn, based on current knowledge, and the result is stored.

A problem they use to illustrate the method is the "change" problem, which I first ran across in SICP. (Another is Towers of Hanoi).

As much as I enjoyed the example, I think it is an unfortunate choice for the biologically oriented student of bioinformatics, because it is really more complex than necessary to illustrate the dynamic programming idea; the complexities are not relevant to the sequence problem, and may well distract the naive student.

I would talk first about a simpler problem. Imagine a triangle made up of elements with integer values, having one element on the first line, two on the second, three on the third and so on. Given a current position, we must move down and either left or right to the next element.

Our problem statement is to evaluate alternative paths from top to bottom through this triangle (e.g. to discover the path with the maximum total value), as described in this problem from Project Euler.

As you can appreciate, triangles can become so large that it is impossible to enumerate and evaluate all paths. The number of paths is 2**(n-1) for a triangle with n levels, and this value gets large rapidly with increasing n. To solve the problem, we work from the bottom to the top.

However, what I really want to talk about in this post is the change problem, which is a version of something more generally known as the knapsack problem.

We have a container (the knapsack) with a known capacity, and objects of various size which we wish to store in it. We may desire to know different things, for example, how many different combinations of objects can we find that completely fill the knapsack? This is our problem statement for the change problem: given American coins (1, 5, 10, and 25 cents), how many ways are there to make change for a dollar?

Rather than go back and look up the answer in SICP (it's been a long time, and I was never much good with Scheme), I decided it would be more fun to try to solve it without looking at any other resources. The key for me was to visualize the problem as a tree, in which the top-level node contains a set of 100 pennies. What can we do? Well, we could change 5 pennies to a nickel, or 10 pennies to a dime, and so on. Each one of these new forms of change for a dollar becomes a sub-node below the first one. From each sub-node, in turn, we attempt further transformations, which will become sub-nodes in turn.

For example, starting with only 10 pennies (P=penny,N=nickel,D=dime):

10P 0N 0D => 5P 1N 0D => 0P 2N 0D => 0P 0N 1D
          => 0P ON 1D

Two things to notice: we can do some kind of depth first traversal, or bread-first. (My graph algorithm skills are weak, but that much seems clear at least). And the depth-first would presumably use recursion (for which this problem is famous). The other thing to notice is that we can get duplicates---there are two paths to the 1 dime state.

My solution (Python code) follows below. Simple things: I use a dict to hold each single kind of change, and a utility function that checks whether it is possible for us to change, say, pennies to a nickel for a current dict. This function returns either a new dict containing the result or None. The loop which solves the problem starts with a list containing a single element, a dict like this:

{ 1:100,5:0,10:0,25:0 }

The code does the following:

for each dict that we have discovered but not tried to manipulate:
    save it in our permanent list of solutions
    work with it further...
    for each coin type i from penny to dime:
        for each coin type j from the next highest to the last one:
            try to make change from i to j
            if we succeed:
                check to be sure this is not a solution we know already
                add it to the list for the next round and mark "not done"

Notably, this solution adaptable to other coin types and other total amounts. However, it seems not to be very efficient, since it slows dramatically with amounts beyond a few dollars. I'll have to work on that.

# how many ways to change a dollar?
# each possible combination is a dict
coins = [1,5,10,25]
N = 100
L = [ { 1:N,5:0,10:0,25:0 } ]

# make substitution if possible 
# change n * coin 1 => coin 2
def change(D,c1,c2):
    rD = dict()
    for k in D:  rD[k] = D[k]
    # 10 => 25 is special
    if c1 is 10 and c2 is 25:
        if rD[10] < 2 or rD[5] < 1:  return None
        else:
            rD[5] -= 1;  rD[10] -= 2
            rD[25] += 1        
    else:
        if not c1*rD[c1] >= c2:  return None
        rD[c1] = rD[c1] - c2/c1
        rD[c2] += 1
    return rD
# - - - - - - - - - - - - - - - - - - - - - - - - - -
# start with N pennies
# for each coin 1..10, try changing to higher denom
# save results in L to explore further changes...

# keep a temp copy separate from L so we can append in loop
temp = list()
results = list()

while True:
    #print 'working:', str(len(L)).rjust(5)
    done = True
    for D in L:  # we already checked, D is not yet in results
        results.append(D)
        # for each possible change given coins available
        for i in range(len(coins)-1):
            for j in range(i,len(coins)):
                # change if possible
                rD = change(D,coins[i],coins[j])
                #if we haven't seen this one yet
                if rD and not (rD in results or rD in temp):
                    temp.append(rD)
                    # since there is a new one(s) do not quit
                    done = False
    if done:  break
    L = temp[:]
    temp = list()

print 'finished', str(len(results)).rjust(5)
results.sort(reverse=True)
for D in results:  print D


finished 242
{1: 100, 10: 0, 5: 0, 25: 0}
{1: 95, 10: 0, 5: 1, 25: 0}
{1: 90, 10: 0, 5: 2, 25: 0}
...
{1: 0, 10: 5, 5: 0, 25: 2}
{1: 0, 10: 0, 5: 0, 25: 4}