Python for Bioinformatics: Tracking evolution of coding sequences

Friday, December 10, 2010

Tracking evolution of coding sequences

I have two posts exploring calculation of the ratio of synonymous to non-synonymous changes when comparing coding sequences. It's more complicated than I thought. This part sets up the utilities we'll need to process the sequences next time. Our goal is to be able to calculate data for a plot like the graphic above, which is from Page & Holmes.

The GeneticCode class constructs the "universal" genetic code in a few lines using list comprehensions and a hard-coded string of one-letter abbreviations for the amino acids.

aa = 'FFLLSSSSYY**CC*WLLLLPPPPHHQQRRRR' +\
     'IIIMTTTTNNKKSSRRVVVVAAAADDEEGGGG'

Objects of this class also contain a reverse dict mapping amino acids to codons, and a third dictionary with codons as keys and a list of synonymous codons as the value. It uses a special cmp function to sort codons in order 'TCAG', i.e. TTT to GGG. I added a function hamming which calculates the Hamming distance between two codons, and a function to construct a set of utility dictionaries. These contain all the codons at a specified hamming distance from each codon. Like this:

nD1
CTT ['TTT', 'CTC', 'CTA', 'CTG', 'CCT', 'CAT'] ..
TAG ['TTG', 'TCG', 'TAT', 'TAC', 'TAA', 'TGG'] ..
..

nD2
CTT ['TTC', 'TTA', 'TTG', 'TCT', 'TAT', 'TGT'] ..
TAG ['TTT', 'TTC', 'TTA', 'TCT', 'TCC', 'TCA'] ..
..

nD3
CTT ['TCC', 'TCA', 'TCG', 'TAC', 'TAA', 'TAG'] ..
TAG ['CTT', 'CTC', 'CTA', 'CCT', 'CCC', 'CCA'] ..
..

The calculation of synonymous and non-synonymous subtitutions proceeds in two parts (Yang). The code to do this is in the module SN_utils.py. The first part is to assess, for each codon one at a time, what fraction of the 9 codons that have a hamming distance of 1 to it, are either synonymous or not. Nonsense codons are discarded before doing the calculation.

For example, CTT has 9 neighbors, none of them stop codons, and 3 of them synonymous. We calculate (3/9) * 3 = 1.0 and record a tuple containing that value as well as 3 minus the value => (1.0, 2.0).

On the other hand, two of TAT's neighbors are stop codons, and one neighbor is synonymous, so we record a value of (1/7) * 3 = 0.429 and 3 minus that => (0.429, 2.571). The values are saved in a dict keyed by each codon.

When we come to evaluate sequences we'll use this info as follows: for each sequence we accumulate the sums of the values for the list of codons (the two elements of the tuple, S and N, are summed separately). Then we average those sums over the two sequences that we're comparing. The two averages will be the denominators of the final ratios for S and N.

The second part of the calculation involves comparisons between two codons. There are four cases, corresponding to a Hamming distance of 0, 1, 2, or 3.

0:
(The same codon), so there is nothing to do, as we just ignore these pairs.

1:
Also simple, we record the tuple (1.0, 0.0) if they are synonomous and (0.0, 1.0) if not.

2:
For this case, there are two positions which need to be changed to convert one codon into the other (at least in the simplest pathway, which is all that we look at). We consider each of the single changes as candidate intermediates. This results in 2 possible paths, with 2 changes for each: C1 => I1 => C2, and C1 => I2 => C2. For each of the 4 single changes, then, we record (0.5, 0.0) if they are synonymous and (0.0, 0.5) otherwise. We sum the tuples for the final value. For example, for the pair CTT CAA, there are two paths (asterisks mark the base that just changed):

        *       *               *     *
CTT => CAT => CAA      CTT => CTA => CAA
Leu    His    Gln      Leu    Leu    Gln

One of the four changes (CTT => CTA) is synonymous, while the other 3 are not. We record the final value (0.5, 1.5).

3:
This is the more complex case. Here there 3 differences and hence 6 possible paths. For the pair GGC and CTT, the paths are:

       *       *       *             *        *     *    
GGC => CGC => CTC => CTT      GGC => CGC => CGT => CTT
Gly    Arg    Leu    Leu      Gly    Arg    Arg    Leu

        *     *        *              *       *    *
GGC => GTC => CTC => CTT      GGC => GTC => GTT => CTT
Gly    Val    Leu    Leu      Gly    Val    Val    Leu

         *    *       *                *     *     *
GGC => GGT => CGT => CTT      GGC => GGT => GTT => CTT
Gly    Gly    Arg    Leu      Gly    Gly    Val    Leu

Of the 18 changes, 6 are synonymous. We record the final value (3.0, 6.0). The debug output shows the program processing this pair:

next pair: GGC CTT
d = 3
test: CGC
process: d = 1 GGC CGC
value = (0.0, 1.0)
process: d = 2 CGC CTT
test intermediates:
CTC syn CTT
count 1
CGT syn CGC count 2
value = (1.0, 1.0)
test: GTC
process: d = 1 GGC GTC
value = (0.0, 1.0)
process: d = 2 GTC CTT
test intermediates:
CTC syn CTT
count 1
GTT syn GTC count 2
value = (1.0, 1.0)
test: GGT
process: d = 1 GGC GGT
value = (1.0, 0.0)
process: d = 2 GGT CTT
test intermediates:
GTT count 0
CGT count 0
value = (0.0, 2.0)
3.0 6.0

The values for each pair are saved in a second dict. We'll exercise that next time. The zipped project files are on Dropbox (here).