Friday, August 8, 2008

Dinucleotide frequencies

On p. 48-50 Deonier outline an analysis of dinucleotide frequencies. They use (what I think is) a modification of the Chi-square statistic, where the standard square of (O - E) divided by E is also divided by a normalizing constant they call c. This is some kind of correction for the fact that nucleotide frequencies aren't equal. I'm not going to post my code here, but I'll put a copy of the file up on the .mac server and it'll last as long as I renew my subscription. I really wish there was a way to post text files to Blogger. The first 1000 bp of genomic sequence for E. coli and M. genitalium were obtained from Genbank (NC_000913 and L43967). What took me more time than anything was to re-discover that string's count method is not suitable for this sort of thing. That's because of this behavior:

>>> s = 'GAAATTC'
>>> s.count('AA')
1
>>> s = 'GAAAAAAAAAAAAAAAAAAAAAATTC'
>>> s.count('AA')
11

This is what I get using the nucleotide frequencies assumed by Deonier. The first block is for E. coli. The second is for M. genitalium.

The columns are: dinucleotide, observed count, expected count, Chi-square, c, Chi-square/c. Note that my values are close but not quite the same as in the book. Also there is (not shown) quite a bit of difference when the true nucleotide frequencies are used, even though they aren't very different than those assumed for this analysis. The values in the last column > 1 indicate a discrepancy between the value expected based on single nucleotide frequency and that observed in the sequence.
AA      86  62.5000   8.8360   1.3125   6.7322
AC 64 62.5000 0.0360 0.8125 0.0443
AG 45 62.5000 4.9000 0.8125 6.0308
AT 63 62.5000 0.0040 0.8125 0.0049
CA 74 62.5000 2.1160 0.8125 2.6043
CC 64 62.5000 0.0360 1.3125 0.0274
CG 69 62.5000 0.6760 0.8125 0.8320
CT 47 62.5000 3.8440 0.8125 4.7311
GA 52 62.5000 1.7640 0.8125 2.1711
GC 85 62.5000 8.1000 0.8125 9.9692
GG 63 62.5000 0.0040 1.3125 0.0030
GT 53 62.5000 1.4440 0.8125 1.7772
TA 45 62.5000 4.9000 0.8125 6.0308
TC 41 62.5000 7.3960 0.8125 9.1028
TG 76 62.5000 2.9160 0.8125 3.5889
TT 72 62.5000 1.4440 1.3125 1.1002

AA 197 202.5000 0.1494 1.2925 0.1156
AC 47 40.5000 1.0432 0.8785 1.1875
AG 43 40.5000 0.1543 0.8785 0.1757
AT 167 166.5000 0.0015 0.5005 0.0030
CA 42 40.5000 0.0556 0.8785 0.0632
CC 10 8.1000 0.4457 1.1557 0.3856
CG 2 8.1000 4.5938 0.9757 4.7082
CT 39 33.3000 0.9757 0.9001 1.0840
GA 37 40.5000 0.3025 0.8785 0.3443
GC 11 8.1000 1.0383 0.9757 1.0641
GG 7 8.1000 0.1494 1.1557 0.1293
GT 29 33.3000 0.5553 0.9001 0.6169
TA 179 166.5000 0.9384 0.5005 1.8750
TC 25 33.3000 2.0688 0.9001 2.2984
TG 32 33.3000 0.0508 0.9001 0.0564
TT 132 136.9000 0.1754 1.3293 0.1319