Python for Bioinformatics: 16S rRNA V regions

Tuesday, March 8, 2011

16S rRNA V regions

I'm exploring ways to visualize the sequence diversity in 16S rRNA. I finally realized that to see the "V" regions clearly, I need an alignment in which the V regions can align, that is, the sequences need to be closely enough related. To get such a set, I went to RDP (Browsers), and grabbed 187 type sequences from the Order Enterobacteriales. In the next post, I'm actually going to generate the graphic above (it's a tease). The figure shows the information content of the sequence set computed for a window sliding across the length of the gene. The troughs are V regions.

But first, I thought I would post some utility code separately. Here is a script that imports the utility functions shown in the last part, below, and exercises them on the enteric sequences:

import utils as ut

data = ut.load_data('entero.txt')
data = data.strip().split('>')[1:]
seqs = [ut.clean_fasta(e)[1] for e in data]
L = ut.make_count_list(seqs,kL='ACGT')
for D in L[75:80]:
   print D

Here is what it prints:

> python rdp.py
{'A': 174, 'C': 2, 'T': 3, 'G': 1}
{'A': 51, 'C': 2, 'T': 3, 'G': 123}
{'A': 3, 'C': 172, 'T': 2, 'G': 3}
{'A': 127, 'C': 5, 'T': 1, 'G': 47}
{'A': 1, 'C': 76, 'T': 3, 'G': 99}

The first module contains a function to return the Shannon entropy for a distribution:

info.py

from math import log
def f_logf(f):  
    if f == 0:  return 0
    return f*log(f)*1.0/log(2)

def shannon(cD,kL=None):
    if not kL:
        kL = cD.keys()
    L = [cD[k] for k in kL]
    S = sum(L)
    if S == 0:
        raise ValueError('Empty list')
    L = [n*1.0/S for n in L]
    L = [f_logf(f) for f in L]
    return 2 + sum(L)

The second module contains some functions that I use all the time. I put this in my site-packages directory:

utils.py

def load_data(fn):
   FH = open(fn,'r')
   data = FH.read().strip()
   if '\r\n' in data: 
       data = s.replace('\r\n', '\n')
   FH.close()
   return data
  
def reverse_complement(seq):
   import string.maketrans
   tt = maketrans('ACGT','TGCA')
   return seq[::-1].translate(tt)

def write_data(fn,s):
   FH = open(fn,'w')
   FH.write(s)
   FH.close()
  
def clean_fasta(s):
   title, seq = s.strip().split('\n',1)
   seq = ''.join(seq.strip().split())
   return title, seq
  
def make_count_list(L,kL=None,upper=True):
   # L is a list of strings (seqs)
   # each str the same length
   # count all chars
   assert len(set([len(s) for s in L])) == 1
   iL = L[:]
   if upper:
       iL = [e.upper() for e in iL]
   if not kL:
       kL = sorted(list(set(''.join(iL))))
   rL = list()
   R = range(len(iL[0]))
   for i in R:
       temp = [e[i] for e in iL]
       counts = [temp.count(k) for k in kL]
       rL.append(dict(zip(kL,counts)))
   return rL