Python for Bioinformatics: Deonier Ch 2 frequency bias

Sunday, August 10, 2008

Deonier Ch 2 frequency bias

My last example for Ch 2 is inspired by exercise 11. We load the E. coli genome sequence, and determine the expected count of 'ACGT' per 1 kb, based on the overall nucleotide composition. Then we slide a window of 1 kb along the genome and compare the observed count to the expected, computing the chiSquared statistic like this:

def chiSquared(s, v=False):
    if len(s) != 1000:  
        print 'error, wrong length'
        return
    total = 0
    for c in b:
        o = s.count(c)
        e = eD[c]
        statistic = 1.0*(o-e)**2/e
        total += statistic
        if v:  print c, o, e, 
        if v:  print '%3.2f' % statistic,
    if v:  print '%3.2f' % total
    return total

We plot the results using R. Notice that segments with high bias rise above the noise, and these segments tend to cluster, as marked by the vertical red line:

If we take a closer look, we resolve particular segments involved.

This region around 3796 - 3800 kb is involved in LPS biosynthesis. This is a screenshot from my GeneFinder app.

Here is a closeup of Fig 1 in Lawrence and Ochman 1998.

And here is part of an abstract of a paper noting the lack of conservation between E. coli and Salmonella. As far as I can tell, the functions of the the genes with very high bias aren't known, but presumably their lack of conservation is related to evolutionary changes in LPS.

If we set the verbose flag, we can go back to this segment and look at the values:

3794 A 263 C 225 G 278 T 234   7.34
3795 A 326 C 162 G 149 T 363 158.39
3796 A 321 C 159 G 173 T 347 125.70
3797 A 345 C 158 G 151 T 346 158.54
3798 A 329 C 180 G 160 T 331 113.72
3799 A 332 C 181 G 153 T 334 122.69
3800 A 314 C 196 G 171 T 319  80.83
3801 A 310 C 210 G 180 T 300  57.69
3802 A 335 C 176 G 122 T 367 184.27
3803 A 282 C 220 G 175 T 323  58.49
3804 A 271 C 237 G 213 T 279  14.72
3805 A 276 C 247 G 205 T 272  16.05
3806 A 293 C 231 G 210 T 266  20.31
3807 A 217 C 273 G 266 T 244   5.42

What would be nice at this point would be to filter the output for regions with extreme values and then recover and order all the ORFs in each. I'll leave that as an exercise for the reader. :)