Python for Bioinformatics: DNA binding sites 3

Sunday, December 19, 2010

DNA binding sites 3

Continuing with binding site analysis, first two posts here and here.

Tom Schneider also invented Sequence Logos, which display the information for binding sites in an intuitive, graphical way (Schneider 1990 PMID 2172928). In that paper, (following Shannon) they define an uncertainty measure for each position in an alignment:

where H(l) equals minus the sum over the four nucleotides of the frequency of each base b at that position times the log₂(freq). Then, the information is:

where e(n) is a small sample correction factor. Thus, uncertainty plus information is constant, and approximately equal to 2 (bits). It's no coincidence that lacking any information about which nucleotide is present at some position in a sequence, you need to ask me two yes-no questions to obtain the identity. For example: is it a purine? Yes. Then is is adenine? Yes. Two questions, two bits.

The script site_score.py does this calculation for the fis sites example, and we plot our home-grown version of the logo as the graphic below.

You can compare that output to what is in the paper:

The colors are switched for the central position because the values for A and T are exactly equal, and we sorted to plot T on top, while Schneider did the reverse.

There is also a site on the web for making logos. To use that, we need to strip the names out of the sequence file.

FH = open('fis.sites.txt','r')
data = FH.read()
FH.close()
for line in data.strip().split('\n')[1:]:
    print line.split()[1]

The only significant difference is at the middle position (11). We didn't use the reversed sequences, so we see mainly A at that position. This is an artifact of the web site's approach.

Zipped project files on Dropbox (here).