Tuesday, March 8, 2011

16S rRNA V regions


I'm exploring ways to visualize the sequence diversity in 16S rRNA. I finally realized that to see the "V" regions clearly, I need an alignment in which the V regions can align, that is, the sequences need to be closely enough related. To get such a set, I went to RDP (Browsers), and grabbed 187 type sequences from the Order Enterobacteriales. In the next post, I'm actually going to generate the graphic above (it's a tease). The figure shows the information content of the sequence set computed for a window sliding across the length of the gene. The troughs are V regions.

But first, I thought I would post some utility code separately. Here is a script that imports the utility functions shown in the last part, below, and exercises them on the enteric sequences:


import utils as ut

data = ut.load_data('entero.txt')
data = data.strip().split('>')[1:]
seqs = [ut.clean_fasta(e)[1] for e in data]
L = ut.make_count_list(seqs,kL='ACGT')
for D in L[75:80]:
print D


Here is what it prints:


> python rdp.py
{'A': 174, 'C': 2, 'T': 3, 'G': 1}
{'A': 51, 'C': 2, 'T': 3, 'G': 123}
{'A': 3, 'C': 172, 'T': 2, 'G': 3}
{'A': 127, 'C': 5, 'T': 1, 'G': 47}
{'A': 1, 'C': 76, 'T': 3, 'G': 99}


The first module contains a function to return the Shannon entropy for a distribution:

info.py


from math import log
def f_logf(f):
if f == 0: return 0
return f*log(f)*1.0/log(2)

def shannon(cD,kL=None):
if not kL:
kL = cD.keys()
L = [cD[k] for k in kL]
S = sum(L)
if S == 0:
raise ValueError('Empty list')
L = [n*1.0/S for n in L]
L = [f_logf(f) for f in L]
return 2 + sum(L)


The second module contains some functions that I use all the time. I put this in my site-packages directory:

utils.py


def load_data(fn):
FH = open(fn,'r')
data = FH.read().strip()
if '\r\n' in data:
data = s.replace('\r\n', '\n')
FH.close()
return data

def reverse_complement(seq):
import string.maketrans
tt = maketrans('ACGT','TGCA')
return seq[::-1].translate(tt)

def write_data(fn,s):
FH = open(fn,'w')
FH.write(s)
FH.close()

def clean_fasta(s):
title, seq = s.strip().split('\n',1)
seq = ''.join(seq.strip().split())
return title, seq

def make_count_list(L,kL=None,upper=True):
# L is a list of strings (seqs)
# each str the same length
# count all chars
assert len(set([len(s) for s in L])) == 1
iL = L[:]
if upper:
iL = [e.upper() for e in iL]
if not kL:
kL = sorted(list(set(''.join(iL))))
rL = list()
R = range(len(iL[0]))
for i in R:
temp = [e[i] for e in iL]
counts = [temp.count(k) for k in kL]
rL.append(dict(zip(kL,counts)))
return rL