Python for Bioinformatics: Over-represented long oligonucleotides

Friday, August 8, 2008

Over-represented long oligonucleotides

We can look at the frequencies of longer oligos in the genome using Python. In this example, I look at the genome of Haemophilus influenzae because I know there is something interesting. The sequence is from Genbank L42023. On the average, we'd expect an individual 12-mer oligo (in a 50% GC genome) to be present once in 1.7 Mbp (4**12).

seq = open('Hinf.genome.txt','r').read().strip()
title, seq = seq.split('\n',1)
print title
seq = ''.join(seq.split())

SZ = 12
D = dict()
for i in range(len(seq) - SZ):
    o = seq[i:i+SZ]
    if D.has_key(o):
        D[o] += 1
    else:
        D[o] = 1

def f(k):
    return D[k]
    
for k in sorted(D.keys(),key=f)[-20:]:
    print k, D[k]
    
FH = open('results.txt','w')
FH.write('\n'.join([str(D[k]) for k in D.keys()]))

It prints a list of oligos which are all related sequences except for the first one.


>gi|6626252|gb|L42023.1| Haemophilus influenzae Rd KW20, complete genome
TTGGTTGGTTGG 72
AAAAGTGCGGTA 76
TTAACCGCACTT 77
TAAAAGTGCGGT 79
AAAAGTGCGGTC 82
TTTACCGCACTT 82
AAAGTGCGGTAA 83
ACCGCACTTTTA 83
AAAAGTGCGGTT 85
TTGACCGCACTT 87
AAGTGCGGTTAA 93
AACCGCACTTTT 95
TTACCGCACTTT 99
AAAGTGCGGTTA 102
TAACCGCACTTT 108
TGACCGCACTTT 110
AAGTGCGGTCAA 115
AAAAAGTGCGGT 120
ACCGCACTTTTT 122
AAAGTGCGGTCA 141

We plot the results using R

setwd('Desktop')
v = read.table('results.txt',head=F)[,1]
mean(v)
median(v)
par('mar' = par('mar')+2)
par(cex.lab = 2)
hist(v,col='blue',breaks = 150,
    xlim=c(0,180),ylim=c(0,20),cex.lab=2)