In previous posts (1, 2) I discussed obtaining the data for all coding sequences (CDS) from the GenBank record for the genome of MG1655, and Affymetrix array expression data (for a rather distant relative, AB1157). Loading and sorting the data is straightforward. (See my script). I obtained expression data for 4345 genes. But only one of the top 50 in expression encodes a protein (lpp) We filter the expression data for CDS's (obtaining 2938 items). Here is a histogram of the values, scaled so that we can look at rare bins on the high end of the distribution.
I chose genes with expression levels in the top 40 as representative of highly expressed genes (values to the right of the red vertical bar, > 276), and I chose genes with expression < 150 as representative of genes with average expression. Here are the top 12:
I also went back and looked at one of the original references (Sharp and Li, 1986, behind a firewall), which gives lists of very highly and highly expressed genes. Interestingly, only 5 of their very highest category qualify in the top 40.
From this point, it is simply a matter of counting all the codons used in each gene for each group and saving the counts in a dict. To analyze the results, for each amino acid, we compare the ratio of the count for a given codon to the total of the synonomous codons, and finally, we compute the ratio of ratios (high expression to average). Here are examples for two amino acids, serine and tyrosine, as well as stop codons:
S TCG 22 0.059 8457 0.156 0.382
The third and fourth columns are the codon count and frequency for highly expressed genes, the fifth and six columns are the same for average genes, and the last column is the ratio of ratios. It's clear that some codons are disfavored for highly expressed genes. For the stop codons, there are not enough examples to say anything with confidence.
Here are some more, where we compare the most and least favored codon for a particular amino acid:
A GCT 250 0.337 14131 0.158 2.138