We're working on a paper from Michael Laub's lab at MIT (Skerker et al 2009 PMID 18555780). Previous posts here and here.
Now it's time to analyze the data. The self comparisons (a single position in the alignment, analyzed for mutual information against itself) yield info values ranging from 4.1 to 0.01.
As you can see in the screenshot, the residue in column 16 (1-based indexing), which has a very low score, is the conserved (catalytic) histidine.
We'll filter out the self comparisons for the plots. Here is the histogram of information values that I got. It's a bit different from the paper, but not much.
In the next part, we'll try to match up the residues which look like they might interact (that have high mutual information) and see if that makes sense in terms of the protein structures.
I picked a familiar HK/RR pair to do this: EnvZ and OmpR. This screenshot of part of Fig 2 shows a section of each protein.
Searching in the alignment file (the alignments don't have titles that I recognize), I recovered a sequence that I think is probably the right one. It matches the Figure as far as I checked:
I put the newlines in to help me count. The alignment is 308 residues total. From Fig 2,
EnvZ (the HK) sequence starts with:
The second line above is from the alignment. The sequence doesn't start at residue 1, however. By my calculation, residue 0 in the alignment (the K) is residue 15 in the protein (1-based index). So we'll adjust the indexes we obtain by adding 15 to compare with the actual protein sequence.
OmpR (the RR) sequence starts with
The second line above is from the alignment, where there's an N at position 0 in this fragment. By my calculation, that N is residue 3 of OmpR in Fig 2 (1-based index), and residue 190 of the alignment, so we subtract 187 for values >= 190.
Going back to the plotting script, we filter the data for pairs in which one comes from the HK and one from the RR, and print the top 20. We do the math as outlined above. As a check, we grab the putative EnvZ/OmpR sequence from the alignment file and print the sequence starting at the position we've identified. Here are the results for the top 20 (in each pair the RR is printed first and the HK second).
[Note: to keep the code simple, I ignored the situation where a column contains '-' a gap in the alignment for some of the sequences. That's what's giving us the two strange results below at position 16 and 20.]
The two residues with highest mutual information are OmpR residue 14 and EnvZ residue 42. It looks pretty good to me.
[ UPDATE: The heatmap looks pretty boring, so I'm going to skip it.
But I plotted the top 20 interactions (graphic at the top of the post), the repetition indicates we're on the right track.. ]