Python for Bioinformatics: Pretty code (3)

Monday, August 22, 2011

Pretty code (3)

I worked a bit more on the Python parser that I described last time (here). Actually, I started over from scratch. The first approach was based on breaking the code into words for processing, but I realized that a "stream" approach of one character at a time is better. The only limitation I'm aware of is that I still don't handle triple-quoted strings, but I think it's doable in the future.

As I mentioned before, I'm using <br /> tags in the code now rather than newlines.

The instructions for doing this said to modify the blog template, but I'm afraid of breaking the formatting for old posts. I put the following just below the <body> tag instead, as shown. It seems to work.

<body>

<style>
  cd { font-size:120%; }
  cm { color: green }
  kw { color: blue; }
  str { color: red; }
</style>

Then, to be totally meta about it, I ran the parser on itself. It is set up to grab HTML tags from a separate file (otherwise the meta application chokes on the tags). They are the ones shown above. Plus, there is a head and a tail to make an independent .html document. But for pasting into the blog, you don't need those guys. I invoked it like this:

python simple_parser2.py simple_parser2.py > example.html

And here is the code:


import sys
from keyword import iskeyword
from utils import load_data
import html_tags as H

try:  
    fn = sys.argv[1]
except IndexError:  
    fn = 'example.py'
data = list(load_data(fn))

D = {'is_cm':False,
     'is_str_1':False,'is_str_2':False }

L = list()
for c in data:
    # comments first
    if c == '#':
        if not (D['is_str_1'] or D['is_str_2']):
            L.extend(list(H.cm_start))
            D['is_cm'] = True
    if c == "\n" and D['is_cm']:
        L.extend(list(H.cm_stop))
        D['is_cm'] = False
    L.append(c)
    
    # single-quoted strings
    if c == "'":
        if not D['is_str_1']:
            if not D['is_str_2']:
                # start a str_1
                L.pop()
                L.extend(list(H.str_start))
                L.append(c)
                D['is_str_1'] = True
            else:
                # already in str_2
                pass
        else:
            # terminate str_1    
            L.extend(list(H.str_stop))
            D['is_str_1'] = False
    # double-quoted strings
    if c == '"':
        if not D['is_str_2']:
            if not D['is_str_1']:
                # start a str_2
                L.pop()
                L.extend(list(H.str_start))
                L.append(c)
                D['is_str_2'] = True
            else:
                # already in str_1
                pass
        else:
            # terminate str_2    
            L.extend(list(H.str_stop))
            D['is_str_2'] = False
s = ''.join(L)

# keywords last
pL = list()
for line in s.split('\n'):
    D['is_cm'] = False
    words = line.split()
    for w in words:
        # no kw highlighting in comments
        if w.startswith(H.cm_start):
            D['is_cm'] = True
        if not D['is_cm'] and iskeyword(w):
            r = H.kw_start + w + H.kw_stop
            line = line.replace(w, r)
    pL.append(line)
s = H.br.join(pL)

pL = [H.head, H.hr, s, H.hr, H.tail]
s = '\n'.join(pL)

try:  
    fn = sys.argv[2]
except IndexError:  
    fn = 'example.html'
FH = open(fn,'w')
FH.write(s + '\n')
FH.close()