Python for Bioinformatics: Mutual information

I want to talk about a really nice paper from Michael Laub's lab at MIT (Skerker et al 2009 PMID 18555780). It'll give us an opportunity to exercise our matplotlib skills.

We're going to try to recreate Fig 1, which is visible in the PubMed page, or you can get the original paper from the link to Cell.

Two-component signal transduction systems are ubiquitous in bacteria (wikipedia). The canonical design consists of a membrane-bound sensor (histidine) kinase (HK) and a cytoplasmic response regulator (RR). E. coli contains about 30 such pairs. The members of each pair have substantial specificity. The HK of the ntr system has specificity for its own RR, and likewise in the phoBR system, phoB has specificity for phoR. We may speak of a HK and its cognate RR.

For our purposes the important thing is that each system comprises (in the simplest design) two protein partners with complementary surfaces. These systems (a pair of proteins) are the products of ancient gene duplication events, and have since diverged over time. Amino acids at interacting sites are constrained to co-evolve in each pair.

If this sounds too vague or too complicated, consider an even simpler example: a stem of paired RNA residues in rRNA named H15.

Here is the H15 sequence in 1D (the parentheses indicate residues involved in pairing---see the link above for details):

((((   (((((    ))))) ))))
TGCACAATGGGCGCAAGCCTGATGCA

And here is the inner stem drawn in 2D to show the base-pairing more directly:

TGGGC
GTCCG

The base-pairing of this stem is more important to rRNA function than the identities of the bases. The result is that in some bacteria the identities of the central bases have been switched:

original   co-evolved

TGGGC        TGCGC
GTCCG        GTGCG

Presumably this happened in 2 discrete steps, but I don't know of any examples where the intermediate state has been preserved. Maybe we should look for some, and it's undoubtedly been studied.

To quantify this kind of coevolution, we'll draw on a concept (and mathematical definition) called mutual information. The steps in the calculation will be:

make a multiple sequence alignment

compare column X and column Y

total number of sequences (length of each column) = c

for each residue x in column X calculate px

for each residue y in column Y calculate py

px is the probability of residue x in column X

We'll write the columns horizontally to save space.
Suppose column X and Y are:

X:  AASSASSTTT
Y:  NMWWNTTKKS

For column X we have:

pA = 0.3 (3 A out of a total of 10 residues)
pS = 0.4
pT = 0.3

For column Y:


pK = 0.2
pM = 0.1
pN = 0.2
pS = 0.1
pT = 0.2
pW = 0.2

We pre-calculate these values for each column. When we calculate the information, we'll refer to the probabilities for column Y as q rather than p, to keep them straight from the p's for column X.

Now, we consider each pair of residues, one from column X and one from column Y. This pair is made up of residues in two interacting protein surfaces or rRNA chains, that may have co-evolved.

pxy is the number of sequences with x in column X and y in column Y

divided by c, the total number of pairs:

X:  AASSASSTTT
Y:  NMWWNTTKKS


pAM = 0.1
pAN = 0.2
pST = 0.2
pSW = 0.2
pTK = 0.2
pTS = 0.1

Finally, to compute the mutual information for this pair of columns, we do this calculation for each individual pair of residues and then sum:

pAM * log (pAM / pAqM) = 0.1 * log (0.1 / (0.3 * 0.1)) = 0.0523
I would have used log₂, but Skerker et al used log₁₀, so I matched them.

Here is part of the output of the script below:

NMWWNTTKKS
AASSASSTTT
pKT 0.2 pK 0.2 qT 0.3 temp 3.33 final 0.1
pMA 0.1 pM 0.1 qA 0.3 temp 3.33 final 0.05
pNA 0.2 pN 0.2 qA 0.3 temp 3.33 final 0.1
pST 0.1 pS 0.1 qT 0.3 temp 3.33 final 0.05
pTS 0.2 pT 0.2 qS 0.4 temp 2.50 final 0.08
pWS 0.2 pW 0.2 qS 0.4 temp 2.50 final 0.08
info 0.47

temp is the result of the calculation inside the parentheses, above. Next time we'll apply this method to the data from Skerker.

from math import log
def log2(n):  return log(n)*1.0/log(2)
def log10(n):  return log(n)*1.0/log(10)

# cache character probabilities for each column
def make_prob_dict(cols):
    # input is a list of columns
    # c is the number of sequences in the alignment
    c = len(cols[0])
    pD = list()
    for col in cols:
        char_kinds = list(set(col))
        values = [col.count(k)*1.0/c for k in char_kinds]
        pD.append(dict(zip(char_kinds,values)))
    return pD

def get_info(i,j,cols,pD,v=False):
    col1, col2 = cols[i], cols[j]
    if v:  print col1 + '\n' + col2
    # as before, c is the number of sequences
    c = len(col1)
    assert c == len(col2)
    info = 0
    pairs = [col1[k] + col2[k] for k in range(c)]
    pL = sorted(list(set(pairs)))
    for p in pL: 
        pXY = pairs.count(p)*1.0/c
        pX = pD[i][p[0]]
        pY = pD[j][p[1]]
        inside = (pXY * 1.0) / (pX * pY)
        if v:  print 'p' + p, pXY,
        if v:  print 'p' + p[0], pX,
        if v:  print 'q' + p[1], pY,
        if v:  print 'temp', '%3.2f' % round(inside, 2),
        outside = pXY * log10(inside)
        if v:  print 'final', round(outside,2)
        info += outside
    if v:  print 'info', round(info,2)
    return info
  
if __name__ == '__main__':
    aln = 'AASSASSTTT\nNMWWNTTKKS\nGTSNTYRSTA\nGGGGGGGGGG'
    cols = aln.split('\n')
    pD = make_prob_dict(cols)
    info = dict()
    for i in range(len(cols)):
        for j in range(i):
            info[(i,j)] = get_info(i,j,cols,pD,v=True)

Tuesday, March 15, 2011

Mutual information