Python for Bioinformatics: Mutual Information (3)

We're working on a paper from Michael Laub's lab at MIT (Skerker et al 2009 PMID 18555780). Previous posts here and here.

Now it's time to analyze the data. The self comparisons (a single position in the alignment, analyzed for mutual information against itself) yield info values ranging from 4.1 to 0.01.

As you can see in the screenshot, the residue in column 16 (1-based indexing), which has a very low score, is the conserved (catalytic) histidine.

We'll filter out the self comparisons for the plots. Here is the histogram of information values that I got. It's a bit different from the paper, but not much.

In the next part, we'll try to match up the residues which look like they might interact (that have high mutual information) and see if that makes sense in terms of the protein structures.

I picked a familiar HK/RR pair to do this: EnvZ and OmpR. This screenshot of part of Fig 2 shows a section of each protein.

Searching in the alignment file (the alignments don't have titles that I recognize), I recovered a sequence that I think is probably the right one. It matches the Figure as far as I checked:

>26250004-26250005/1-457
KQLADDRTLLMAGVSHDLRTPLTRI
RLATEMMSAESINKDIEECNAIIEQ
FIDYLRTGMADLNAVLGEVIA--AE
SGYEREIETAL-YVKMHPLSIKRAV
ANMVVNAARY-GNGWIKVSSGTEAW
FQVEDDGPGIAPEQRKHLFQPFVRG
DISGTGLGLAIVQRIVDNHNGMLEL
GTSERGGLSIRAWLPNYKILVVDDD
MRLRALLERYLTEQGFQVRSVANAE
QMDRLLTRESFHLMVLDLMLPGEDG
LSICRRLRSQSPMPIIMVTAKGEEV
DRIVGLEIGADDYIPKPFNPRELLA
RIRAVLRR

I put the newlines in to help me count. The alignment is 308 residues total. From Fig 2,

EnvZ (the HK) sequence starts with:

AGVKQLADDRTLLMAGVSHDL
   KQLADDRTLLMAGVSHDL

The second line above is from the alignment. The sequence doesn't start at residue 1, however. By my calculation, residue 0 in the alignment (the K) is residue 15 in the protein (1-based index). So we'll adjust the indexes we obtain by adding 15 to compare with the actual protein sequence.

OmpR (the RR) sequence starts with

MQENYKILVVDDDMRLRALLER
   NYKILVVDDDMRLRALLER

The second line above is from the alignment, where there's an N at position 0 in this fragment. By my calculation, that N is residue 3 of OmpR in Fig 2 (1-based index), and residue 190 of the alignment, so we subtract 187 for values >= 190.

Going back to the plotting script, we filter the data for pairs in which one comes from the HK and one from the RR, and print the top 20. We do the math as outlined above. As a check, we grab the putative EnvZ/OmpR sequence from the alignment file and print the sequence starting at the position we've identified. Here are the results for the top 20 (in each pair the RR is printed first and the HK second).

[Note: to keep the code simple, I ignored the situation where a column contains '-' a gap in the alignment for some of the sequences. That's what's giving us the two strange results below at position 16 and 20.]

> python plot.py 
 14 RLRAL  42 ATEMM 0.822
 18 LLERY  42 ATEMM 0.816
 22 YLTEQ  42 ATEMM 0.769
 15 LRALL  37 TRIRL 0.730
 15 LRALL  42 ATEMM 0.694
 56 MLPGE  54 DIEEC 0.688
 14 RLRAL  54 DIEEC 0.678
 21 RYLTE  42 ATEMM 0.677
 14 RLRAL  38 RIRLA 0.668
 83 KGEEV  22 TLLMA 0.667
 22 YLTEQ  21 RTLLM 0.663
  4 YKILV  42 ATEMM 0.658
 83 KGEEV  54 DIEEC 0.657
 22 YLTEQ  18 ADDRT 0.650
 22 YLTEQ  54 DIEEC 0.646
 18 LLERY  86 --AES 0.644
 15 LRALL  54 DIEEC 0.643
 18 LLERY  38 RIRLA 0.636
 22 YLTEQ  45 MMSAE 0.633
 22 YLTEQ  86 --AES 0.631

The two residues with highest mutual information are OmpR residue 14 and EnvZ residue 42. It looks pretty good to me.

[ UPDATE: The heatmap looks pretty boring, so I'm going to skip it.
But I plotted the top 20 interactions (graphic at the top of the post), the repetition indicates we're on the right track.. ]

import sys
from utils import load_data
import matplotlib.pyplot as plt

data = load_data('results.2.txt')
data = data.strip().split('\n')
data = [e.split() for e in data]
data = [(int(t[0]), int(t[1]), float(t[2])) for t in data]

def f(t):  return float(t[2])
def part1():
    L = [t for t in data if t[0] == t[1]]
    L = sorted(L,key=f,reverse=True)
    for t in L[:10]:  print str(t[0]+1).rjust(3),round(t[2],3)
    print
    for t in L[-10:]:  print str(t[0]+1).rjust(3),round(t[2],3)
    sys.exit()
    
#part1()


L = [t[2] for t in data if t[0] != t[1]]
X = 1.0
plt.hist(L,bins=X*50)
ax = plt.axes()
ax.set_xlim(0,X)
plt.savefig('example.png')

# t[0] always > t[1]
N = 190
data = [t for t in data if t[0] >= N and t[1] < N]

aln = load_data('cell3925mmc4.fa')
aln = aln.strip().split('>')[1:]
aln = [e for e in aln if e.startswith('26250004-26250005/1-457')]
envZ_ompR = aln[0].split('\n')[1]

for t in sorted(data,key=f,reverse=True)[:20]:
    i = t[0]
    rr = i - 187
    j = t[1]
    hk = j + 15
    print str(rr).rjust(3), envZ_ompR[i:i+5],
    print str(hk).rjust(3), envZ_ompR[j:j+5],
    print '%3.3f' % t[2]

Friday, March 18, 2011

Mutual Information (3)