Tuesday, March 8, 2011

16S rRNA V regions, continued

As promised last time (here), this is the code to plot V regions from the sequences in the Enterobacteriales, obtained from RDP. If you compare to this post from the other day, you'll see we match pretty well.

The figure looks better with a wider window, but the limits are established more accurately with a smaller window. The threshold T needs to be adjusted based on the window size.

One detail: I did the redirect below to save the extreme values and then parsed them with the code at the end of the post.

python script.py > extreme_values.txt


import utils as ut
from info import shannon
import matplotlib.pyplot as plt

data = ut.load_data('entero.txt')
data = data.strip().split('>')[1:]
EC = [e for e in data if 'X80725' in e][0]
EC = ut.clean_fasta(EC)[1]
data = [ut.clean_fasta(e)[1] for e in data]
L = ut.make_count_list(data)

pos = 0
R = range(1,1451)
iL = list()

for i,c in enumerate(EC):
if c == '-': continue
pos += 1
if not pos in R: continue
cD = L[i]
e = shannon(cD,'ACGT')
#print str(pos).ljust(3), c + ' ',
#print ''.join([str(cD[k]).ljust(5) for k in 'ACGT']),
#print round(e,2)

aL = list()
w = 20
T = 1.8
for i in range(len(iL)):
j, k = i - w, i + w + 1
if j < 0: j = 0
if k > len(iL): k = len(iL)
m = ut.mean(iL[j:k])
if m < T: print i+1

ax = plt.axes()


import utils as ut
data = ut.load_data('extreme_values.txt')
L = [int(n) for n in data.strip().split('\n')]

current = L.pop(0)
print current, '-',
while L:
next = L.pop(0)
if next != current + 1:
print current
print next, '-',
current = next
print next