Monday, August 11, 2008

PAM: Point Accepted Mutation

I want to spend some time talking about substitution matrices and Markov models.These matrices underly most (all?) methods for protein sequence comparison. They also reflect an underlying model of a simple Markov chain which describes how each position in a sequence changes stochastically with time.

The first one that I know of is the PAM matrix, developed by Margaret Dayhoff and colleagues. The reference is the 1978 paper listed on the page above, but it isn't in PubMed. I see there is now a link to a pdf on the web. (It's behind a firewall now, sorry). I got it from my library, but my copy isn't any better than the one above, where some numbers in the first figure aren't legible. In fact, my copy looks like exactly the same crappy scan that is up.

If you are learning Bioinformatics (and I assume you are if you are reading this) I recommend that you take the time to read the paper carefully and reproduce the results. It's well worth the investment.

It's curious that the paper talks about "accepted point mutations" and PAM, but never explicitly makes the link or explains why they rearranged the order of the letters in the acronym.

They started with the catalogued proteins in the "Atlas of Protein Sequence and Structure" in 1978, and collected 34 sets of closely related "superfamilies" grouped into 71 evolutionary trees. Sequences within a tree were < 15% different. So these are quite closely related sequences. The method for making the trees was parsimony, and ancestral sequences were reconstructed. They give an example, shown here:

Because B is present in the second position for both the second and the fourth sequences at the top (present day), it is inferred to be ancestral for the first and third. Each change on the tree was recorded in both directions. If they infer A => C, they record 1 for C => A as well. They indicate that because the trees could not always be completely determined, in such cases the substitution was apportioned among all possibilities according to the amino acid frequencies in the data. Because of this, they had fractional values, which were removed by multiplying the entire matrix by 10. Thus, the final version has 1572 changes, each counted twice, times 10 = 31440. I transcribed their first Figure, guessing at a few of the numbers, and my total is 31480. So I have four extra changes counted in my data set.

The next piece of data is the table of frequencies for the amino acids in their data set. This is called "Normalized the Accepted Point Mutation Data," which is a bit confusing because it is not related to the mutation data, just to the protein sequences the started with. They normalize the observed changes involving each amino acid to the frequency in the sequences (what they call its "exposure"). They call this value relative mutability, and it varies over nearly an order of magnitude. It reflects the entire process of evolution (i.e. rates of mutation and rates of acceptance during evolutionary time). Here is the table I generated from their data. The columns are amino acid, observed changes, amino acid frequency, and the last column is the relative mutability, scaled to alanine = 100%):

In the last part of the first half, they calculate a new matrix based on the data. This is going to be the matrix of transmission frequencies for our Markov model. The matrix is arranged with "original" amino acids as the column heads, and "replacement" amino acids as the rows. (This is the opposite of the way it is normally done today). Each column in the matrix shows the probability that the indicated "original" amino acid will change to a different one, or stay the same. To make the table easy to look at, their formulation has integer values which sum to 10000. So you must divide by 10000 to get the actual frequencies. Their formula is:

The subscript i refers to rows (new amino acid) and j refers to columns (old amino acid). The matrix entry is calculated as a constant (lambda) times the mutability for the old amino acid times the number of changes seen between old and new, divided by all changes seen for the old amino acid.

There is an interesting bit of circular reasoning in construction of the matrix. Both mutability and total changes appear in the formula, but mutability has already been calculated in terms of the sum of all changes for that amino acid. So an equivalent formulation is:

where K is a different constant and f is frequency. I wrote a Python script to calculate the PAM1 matrix from their data. Here is the resulting matrix, loaded into R. It is nearly the same as theirs, but not identical, probably because of transcription errors in the first table.