I'm exploring ways to visualize the sequence diversity in 16S rRNA. I finally realized that to see the "V" regions clearly, I need an alignment in which the V regions can align, that is, the sequences need to be closely enough related. To get such a set, I went to RDP (Browsers), and grabbed 187 type sequences from the Order Enterobacteriales. In the next post, I'm actually going to generate the graphic above (it's a tease). The figure shows the information content of the sequence set computed for a window sliding across the length of the gene. The troughs are V regions.
But first, I thought I would post some utility code separately. Here is a script that imports the utility functions shown in the last part, below, and exercises them on the enteric sequences:
Here is what it prints:
The first module contains a function to return the Shannon entropy for a distribution:
info.py
The second module contains some functions that I use all the time. I put this in my
site-packages
directory:utils.py