## Tuesday, March 8, 2011

### 16S rRNA V regions

I'm exploring ways to visualize the sequence diversity in 16S rRNA. I finally realized that to see the "V" regions clearly, I need an alignment in which the V regions can align, that is, the sequences need to be closely enough related. To get such a set, I went to RDP (Browsers), and grabbed 187 type sequences from the Order Enterobacteriales. In the next post, I'm actually going to generate the graphic above (it's a tease). The figure shows the information content of the sequence set computed for a window sliding across the length of the gene. The troughs are V regions.

But first, I thought I would post some utility code separately. Here is a script that imports the utility functions shown in the last part, below, and exercises them on the enteric sequences:

 `import utils as utdata = ut.load_data('entero.txt')data = data.strip().split('>')[1:]seqs = [ut.clean_fasta(e)[1] for e in data]L = ut.make_count_list(seqs,kL='ACGT')for D in L[75:80]: print D`

Here is what it prints:

 `> python rdp.py{'A': 174, 'C': 2, 'T': 3, 'G': 1}{'A': 51, 'C': 2, 'T': 3, 'G': 123}{'A': 3, 'C': 172, 'T': 2, 'G': 3}{'A': 127, 'C': 5, 'T': 1, 'G': 47}{'A': 1, 'C': 76, 'T': 3, 'G': 99}`

The first module contains a function to return the Shannon entropy for a distribution:

`info.py`

 `from math import logdef f_logf(f): if f == 0: return 0 return f*log(f)*1.0/log(2)def shannon(cD,kL=None): if not kL: kL = cD.keys() L = [cD[k] for k in kL] S = sum(L) if S == 0: raise ValueError('Empty list') L = [n*1.0/S for n in L] L = [f_logf(f) for f in L] return 2 + sum(L)`

The second module contains some functions that I use all the time. I put this in my `site-packages` directory:

`utils.py`

 `def load_data(fn): FH = open(fn,'r') data = FH.read().strip() if '\r\n' in data: data = s.replace('\r\n', '\n') FH.close() return data def reverse_complement(seq): import string.maketrans tt = maketrans('ACGT','TGCA') return seq[::-1].translate(tt)def write_data(fn,s): FH = open(fn,'w') FH.write(s) FH.close() def clean_fasta(s): title, seq = s.strip().split('\n',1) seq = ''.join(seq.strip().split()) return title, seq def make_count_list(L,kL=None,upper=True): # L is a list of strings (seqs) # each str the same length # count all chars assert len(set([len(s) for s in L])) == 1 iL = L[:] if upper: iL = [e.upper() for e in iL] if not kL: kL = sorted(list(set(''.join(iL)))) rL = list() R = range(len(iL[0])) for i in R: temp = [e[i] for e in iL] counts = [temp.count(k) for k in kL] rL.append(dict(zip(kL,counts))) return rL`