Python for Bioinformatics: Deonier Ch 2 Codon Bias

Sunday, August 10, 2008

Deonier Ch 2 Codon Bias - Codon Data

Continuing with the problem of analyzing codon bias, there is abundant evidence that the preferred codon usage pattern varies greatly. There are several ideas about why this might be (see e.g. Andersson and Kurland 1990). What I'm concerned with here is to explore whether there is a relationship between expression level and codon usage in E. coli. Please note that this is not a scholarly examination of the question. I am really just trying to develop my scripting skills by reading and exploring Deonier.

In previous posts (1, 2) I discussed obtaining the data for all coding sequences (CDS) from the GenBank record for the genome of MG1655, and Affymetrix array expression data (for a rather distant relative, AB1157). Loading and sorting the data is straightforward. (See my script). I obtained expression data for 4345 genes. But only one of the top 50 in expression encodes a protein (lpp) We filter the expression data for CDS's (obtaining 2938 items). Here is a histogram of the values, scaled so that we can look at rare bins on the high end of the distribution.

I chose genes with expression levels in the top 40 as representative of highly expressed genes (values to the right of the red vertical bar, > 276), and I chose genes with expression < 150 as representative of genes with average expression. Here are the top 12:

lpp   3343.85
rmf   1953.51
fimA  1112.27
cspC   853.39
yfiD   840.27
rpmJ   759.78
yiiU   635.30
cspG   597.44
hns    570.08
cspE   549.95
udp    537.76
dnaK   470.56

I also went back and looked at one of the original references (Sharp and Li, 1986, behind a firewall), which gives lists of very highly and highly expressed genes. Interestingly, only 5 of their very highest category qualify in the top 40.

From this point, it is simply a matter of counting all the codons used in each gene for each group and saving the counts in a dict. To analyze the results, for each amino acid, we compare the ratio of the count for a given codon to the total of the synonomous codons, and finally, we compute the ratio of ratios (high expression to average). Here are examples for two amino acids, serine and tyrosine, as well as stop codons:

S  TCG       22  0.059    8457  0.156  0.382
S  TCA       21  0.057    6449  0.119  0.478
S  AGT       37  0.100    7923  0.146  0.685
S  AGC       82  0.222   15247  0.281  0.789
S  TCC       91  0.246    8282  0.153  1.612
S  TCT      117  0.316    7909  0.146  2.170

Y  TAT       75  0.362   15155  0.561  0.646
Y  TAC      132  0.638   11852  0.439  1.453

*  TGA        8  0.200     796  0.282  0.710
*  TAG        2  0.050     198  0.070  0.714
*  TAA       30  0.750    1832  0.648  1.157

The third and fourth columns are the codon count and frequency for highly expressed genes, the fifth and six columns are the same for average genes, and the last column is the ratio of ratios. It's clear that some codons are disfavored for highly expressed genes. For the stop codons, there are not enough examples to say anything with confidence.

Here are some more, where we compare the most and least favored codon for a particular amino acid:

A  GCT      250  0.337   14131  0.158  2.138
A  GCC      116  0.156   24186  0.270  0.580

C  TGC       37  0.627    6080  0.564  1.111
C  TGT       22  0.373    4693  0.436  0.856

D  GAC      234  0.505   18400  0.378  1.338
D  GAT      229  0.495   30307  0.622  0.795

E  GAA      408  0.730   37809  0.691  1.056
E  GAG      151  0.270   16878  0.309  0.875

F  TTC      163  0.639   16091  0.435  1.471
F  TTT       92  0.361   20942  0.565  0.638

G  GGT      300  0.503   23714  0.339  1.483
G  GGA       23  0.039    7113  0.102  0.379

H  CAC       80  0.552    9403  0.438  1.260
H  CAT       65  0.448   12073  0.562  0.797

I  ATC      275  0.608   24210  0.426  1.428
I  ATA       10  0.022    3708  0.065  0.339

K  AAA      384  0.814   31585  0.768  1.059
K  AAG       88  0.186    9525  0.232  0.805

L  CTG      437  0.714   51779  0.509  1.402
L  CTA       10  0.016    3588  0.035  0.463

M  ATG      184  1.000   26359  1.000  1.000

N  AAC      258  0.789   20608  0.560  1.409
N  AAT       69  0.211   16185  0.440  0.480

P  CCG      164  0.672   22815  0.542  1.240
P  CCC        7  0.029    4951  0.118  0.244

Q  CAG      272  0.791   27460  0.658  1.201
Q  CAA       72  0.209   14248  0.342  0.613

R  CGT      234  0.578   20306  0.390  1.480
R  CGG       11  0.027    4727  0.091  0.299

S  TCT      117  0.316    7909  0.146  2.170
S  TCG       22  0.059    8457  0.156  0.382

T  ACT      152  0.328    8277  0.164  2.001
T  ACG       57  0.123   13497  0.267  0.460

V  GTT      261  0.441   17052  0.254  1.735
V  GTC       66  0.111   14442  0.215  0.518