Sunday, November 28, 2010

Heat map meets RNA secondary structure



I have seen somewhere a combination heat map and secondary structure plot of the bacterial 16S rRNA gene. Something like the graphic above but with colors that show the extent of conservation at each position. I think it's probably what is referenced in PMID 8811093, except that the server listed in the reference seems to be offline, and what's in the paper doesn't look as great as what I remember.

So that's our project for today. The first thing I did was go to a page at NCBI that list all the completed bacterial genomes and gives a link to pages with their RNA or protein products. I'm not actually sure where this page lives any more, I got the link from an old post. :)

I got the sequence of rrnC from E. coli MG1655 and then I shortened the title and cut out everything beyond nt 560:

>Ecoli
AAATTGAAGAGTTTGATCATGGCTCAGATTGAACGCTGGCGGCAGGCCTAACACATGCAAGTCGAACGGT
AACAGGAAACAGCTTGCTGTTTCGCTGACGAGTGGCGGACGGGTGAGTAATGTCTGGGAAACTGCCTGAT
GGAGGGGGATAACTACTGGAAACGGTAGCTAATACCGCATAACGTCGCAAGACCAAAGAGGGGGACCTTC
GGGCCTCTTGCCATCAGATGTGCCCAGATGGGATTAGCTAGTAGGTGGGGTAACGGCTCACCTAGGCGAC
GATCCCTAGCTGGTCTGAGAGGATGACCAGCCACACTGGAACTGAGACACGGTCCAGACTCCTACGGGAG
GCAGCAGTGGGGAATATTGCACAATGGGCGCAAGCCTGATGCAGCCATGCCGCGTGTATGAAGAAGGCCT
TCGGGTTGTAAAGTACTTTCAGCGGGGAGGAAGGGAGTAAAGTTAATACCTTTGCTCATTGACGTTACCC
GCAGAAGAAGCACCGGCTAACTCCGTGCCAGCAGCCGCGGTAATACGGAGGGTGCAAGCGTTAATCGGAA

I used the Vienna RNA server to fold it, and recovered a heat map version as an eps file. You can check the predicted structure against the scan above, it looks pretty good. The eps file lists the colors in terms of a heat map. But the values are for base-pair probabilities, not what I want. Luckily, the colors are encoded as values between 0 and 1, one on each line from 876 to 1435 (= 560).

I grabbed 100 of my personal database sequences at random, added the E. coli sequence and used MUSCLE to make an alignment. Then I wrote two simple scripts. One gets the counts for each position that is present in the E. coli sequence and writes them to disk. The second one assigns colors to the values based on whatever filter I put in it, the one here is:

steps =  [ 0.97, 0.9,  0.8, 0.7,  0.6,  0.5, 0.4 ]
colors = [ 1.0, 0.95, 0.8, 0.6, 0.45, 0.3, 0.15 ]

For each set of counts, we get the maximum, then at each step if we're over the step we return the color.

That script also modifes the eps file. Afterwards, I just double-click to open the eps file in Preview. It looks great! I added the color bar from the original version by hand. Zipped folder with project files is here. Since the eps file contains the layout position for each nucleotide, we could get more sophisticated and do the whole plot ourselves.

Note: no error checking at all yet. Hope there's no whoppers...
[ UPDATE: Notice the co-variation among paired bases, they tend to have the same color. Awesome! ]