We can look at the frequencies of longer oligos in the genome using Python. In this example, I look at the genome of Haemophilus influenzae because I know there is something interesting. The sequence is from Genbank L42023. On the average, we'd expect an individual 12-mer oligo (in a 50% GC genome) to be present once in 1.7 Mbp (4**12).
seq = open('Hinf.genome.txt','r').read().strip() |
It prints a list of oligos which are all related sequences except for the first one.
|
We plot the results using R
setwd('Desktop') |