Friday, March 18, 2011

Mutual Information (3)


We're working on a paper from Michael Laub's lab at MIT (Skerker et al 2009 PMID 18555780). Previous posts here and here.

Now it's time to analyze the data. The self comparisons (a single position in the alignment, analyzed for mutual information against itself) yield info values ranging from 4.1 to 0.01.


 90 4.102
260 4.025
122 3.993
209 3.976
234 3.944
101 3.939
235 3.938
152 3.935
63 3.91
222 3.887

159 0.191
250 0.138
132 0.133
291 0.086
199 0.067
242 0.052
16 0.012
106 0.012
156 0.012
158 0.012


As you can see in the screenshot, the residue in column 16 (1-based indexing), which has a very low score, is the conserved (catalytic) histidine.



We'll filter out the self comparisons for the plots. Here is the histogram of information values that I got. It's a bit different from the paper, but not much.



In the next part, we'll try to match up the residues which look like they might interact (that have high mutual information) and see if that makes sense in terms of the protein structures.

I picked a familiar HK/RR pair to do this: EnvZ and OmpR. This screenshot of part of Fig 2 shows a section of each protein.



Searching in the alignment file (the alignments don't have titles that I recognize), I recovered a sequence that I think is probably the right one. It matches the Figure as far as I checked:


>26250004-26250005/1-457
KQLADDRTLLMAGVSHDLRTPLTRI
RLATEMMSAESINKDIEECNAIIEQ
FIDYLRTGMADLNAVLGEVIA--AE
SGYEREIETAL-YVKMHPLSIKRAV
ANMVVNAARY-GNGWIKVSSGTEAW
FQVEDDGPGIAPEQRKHLFQPFVRG
DISGTGLGLAIVQRIVDNHNGMLEL
GTSERGGLSIRAWLPNYKILVVDDD
MRLRALLERYLTEQGFQVRSVANAE
QMDRLLTRESFHLMVLDLMLPGEDG
LSICRRLRSQSPMPIIMVTAKGEEV
DRIVGLEIGADDYIPKPFNPRELLA
RIRAVLRR


I put the newlines in to help me count. The alignment is 308 residues total. From Fig 2,

EnvZ (the HK) sequence starts with:


AGVKQLADDRTLLMAGVSHDL
KQLADDRTLLMAGVSHDL


The second line above is from the alignment. The sequence doesn't start at residue 1, however. By my calculation, residue 0 in the alignment (the K) is residue 15 in the protein (1-based index). So we'll adjust the indexes we obtain by adding 15 to compare with the actual protein sequence.

OmpR (the RR) sequence starts with


MQENYKILVVDDDMRLRALLER
NYKILVVDDDMRLRALLER


The second line above is from the alignment, where there's an N at position 0 in this fragment. By my calculation, that N is residue 3 of OmpR in Fig 2 (1-based index), and residue 190 of the alignment, so we subtract 187 for values >= 190.

Going back to the plotting script, we filter the data for pairs in which one comes from the HK and one from the RR, and print the top 20. We do the math as outlined above. As a check, we grab the putative EnvZ/OmpR sequence from the alignment file and print the sequence starting at the position we've identified. Here are the results for the top 20 (in each pair the RR is printed first and the HK second).

[Note: to keep the code simple, I ignored the situation where a column contains '-' a gap in the alignment for some of the sequences. That's what's giving us the two strange results below at position 16 and 20.]


> python plot.py 
14 RLRAL 42 ATEMM 0.822
18 LLERY 42 ATEMM 0.816
22 YLTEQ 42 ATEMM 0.769
15 LRALL 37 TRIRL 0.730
15 LRALL 42 ATEMM 0.694
56 MLPGE 54 DIEEC 0.688
14 RLRAL 54 DIEEC 0.678
21 RYLTE 42 ATEMM 0.677
14 RLRAL 38 RIRLA 0.668
83 KGEEV 22 TLLMA 0.667
22 YLTEQ 21 RTLLM 0.663
4 YKILV 42 ATEMM 0.658
83 KGEEV 54 DIEEC 0.657
22 YLTEQ 18 ADDRT 0.650
22 YLTEQ 54 DIEEC 0.646
18 LLERY 86 --AES 0.644
15 LRALL 54 DIEEC 0.643
18 LLERY 38 RIRLA 0.636
22 YLTEQ 45 MMSAE 0.633
22 YLTEQ 86 --AES 0.631


The two residues with highest mutual information are OmpR residue 14 and EnvZ residue 42. It looks pretty good to me.

[ UPDATE: The heatmap looks pretty boring, so I'm going to skip it.
But I plotted the top 20 interactions (graphic at the top of the post), the repetition indicates we're on the right track.. ]


import sys
from utils import load_data
import matplotlib.pyplot as plt

data = load_data('results.2.txt')
data = data.strip().split('\n')
data = [e.split() for e in data]
data = [(int(t[0]), int(t[1]), float(t[2])) for t in data]

def f(t): return float(t[2])
def part1():
L = [t for t in data if t[0] == t[1]]
L = sorted(L,key=f,reverse=True)
for t in L[:10]: print str(t[0]+1).rjust(3),round(t[2],3)
print
for t in L[-10:]: print str(t[0]+1).rjust(3),round(t[2],3)
sys.exit()

#part1()


L = [t[2] for t in data if t[0] != t[1]]
X = 1.0
plt.hist(L,bins=X*50)
ax = plt.axes()
ax.set_xlim(0,X)
plt.savefig('example.png')

# t[0] always > t[1]
N = 190
data = [t for t in data if t[0] >= N and t[1] < N]

aln = load_data('cell3925mmc4.fa')
aln = aln.strip().split('>')[1:]
aln = [e for e in aln if e.startswith('26250004-26250005/1-457')]
envZ_ompR = aln[0].split('\n')[1]

for t in sorted(data,key=f,reverse=True)[:20]:
i = t[0]
rr = i - 187
j = t[1]
hk = j + 15
print str(rr).rjust(3), envZ_ompR[i:i+5],
print str(hk).rjust(3), envZ_ompR[j:j+5],
print '%3.3f' % t[2]