Sunday, December 19, 2010

DNA binding sites 2

The first few sequences in fis.sites.txt are:

#format Schneider
1 tttgccgattatttacgcaaa
2 agtgactaaaatttacactca
3 gtggtgcgataattactcata
4 attgcatttaaaatgagcgtg
5 attggtcaaagtttggccttt

The first few lines of output from site_utils.py:

ttatgtacaaatagtaagaaatgtctgaga..
[45, 10, 26, 39]
0.567 -1.603 -0.2245 0.3605
ttggtatatatactatacacctatatttga..
[28, 22, 21, 49]
-0.1175 -0.4655 -0.5326 0.6898

For Schneider's approach each site is present in the alignment in both forward and reverse complement orientation. In the output we are looking at each column of the alignment. The first base 't' of the top line of output is from sequence #1, the second base 't' is from the reverse of sequence #1, the third base 'a' is from sequence #2, and so on. The counts are given next, and sum to N = 120 = 60 * 2. The scores are calculated as given previously:

2 + log2(freq) - 0.018

The scores for the first 11 sequences are:

1          12.2
2 11.8
3 9.0
4 6.5
5 12.2
6 8.5
7 8.4
8 4.6
9 12.0
10 5.3
11 10.4

If you look at the graphic from last time (or the paper) you'll see that we match. So I think we're doing things correctly.

The last thing we do when running this basic site_utils.py file as __main__ is to look at some random sequences. For starters, we calculate

avg for 100000 random seqs: -10.85

which is roughly -0.5 for each position in the 21 nucleotide alignment. According to my understanding the purpose of the correction term was to make this be zero, and I'm not sure why it isn't.

The top 10 sites from the random sequences had scores:

11.72
10.73
10.73
10.52
10.48
10.28
10.08
10.08
10.07
9.86

There were a total of 9 sequences with scores > 10, while only 14 of the 60 authentic sites scored that high. So depending on where our cutoff is for an authentic site we'll get a lot of false positives. For this example, with a cutoff of 5, I found 267 of them.