We're going to try to recreate Fig 1, which is visible in the PubMed page, or you can get the original paper from the link to Cell.
Two-component signal transduction systems are ubiquitous in bacteria (wikipedia). The canonical design consists of a membrane-bound sensor (histidine) kinase (HK) and a cytoplasmic response regulator (RR). E. coli contains about 30 such pairs. The members of each pair have substantial specificity. The HK of the ntr system has specificity for its own RR, and likewise in the phoBR system, phoB has specificity for phoR. We may speak of a HK and its cognate RR.
For our purposes the important thing is that each system comprises (in the simplest design) two protein partners with complementary surfaces. These systems (a pair of proteins) are the products of ancient gene duplication events, and have since diverged over time. Amino acids at interacting sites are constrained to co-evolve in each pair.
If this sounds too vague or too complicated, consider an even simpler example: a stem of paired RNA residues in rRNA named H15.
Here is the H15 sequence in 1D (the parentheses indicate residues involved in pairing---see the link above for details):
And here is the inner stem drawn in 2D to show the base-pairing more directly:
The base-pairing of this stem is more important to rRNA function than the identities of the bases. The result is that in some bacteria the identities of the central bases have been switched:
Presumably this happened in 2 discrete steps, but I don't know of any examples where the intermediate state has been preserved. Maybe we should look for some, and it's undoubtedly been studied.
To quantify this kind of coevolution, we'll draw on a concept (and mathematical definition) called mutual information. The steps in the calculation will be:
We'll write the columns horizontally to save space.
Suppose column X and Y are:
For column X we have:
pA = 0.3 (3 A out of a total of 10 residues)
pS = 0.4
pT = 0.3
For column Y:
pK = 0.2
pM = 0.1
pN = 0.2
pS = 0.1
pT = 0.2
pW = 0.2
We pre-calculate these values for each column. When we calculate the information, we'll refer to the probabilities for column Y as q rather than p, to keep them straight from the p's for column X.
Now, we consider each pair of residues, one from column X and one from column Y. This pair is made up of residues in two interacting protein surfaces or rRNA chains, that may have co-evolved.
pAM = 0.1
pAN = 0.2
pST = 0.2
pSW = 0.2
pTK = 0.2
pTS = 0.1
Finally, to compute the mutual information for this pair of columns, we do this calculation for each individual pair of residues and then sum:
pAM * log (pAM / pAqM) = 0.1 * log (0.1 / (0.3 * 0.1)) = 0.0523
I would have used log2, but Skerker et al used log10, so I matched them.
Here is part of the output of the script below:
temp is the result of the calculation inside the parentheses, above. Next time we'll apply this method to the data from Skerker.