Python for Bioinformatics: Outliner

Saturday, December 12, 2009

Outliner

As you can tell from recent posts, I'm exploring PyCogent (docs). The most important reason that I'm interested in it is that it has powerful methods for doing phylogenetic analysis. Also, I have a good feeling about the people involved. I hope it has staying power. I never had that feeling about BioPython.

So I've started looking at the Cookbook and other examples. As usual, the docs don't seem too complete. (Nobody wants to write documentation). Still, I think it could help me get better at phylogenetics to go through this project in more detail. To help me understand, I needed a way to look at the organization of a documentation file at each different levels.

The levels are keyed by symbols in the text that occur on the line following a header, like this.

Sequences
=========

Python is a great tool for problems like this. I wrote a short script to make outlines from rst files. It looks cleaner if you use spaces instead of the control characters, but this helps me see what's going on. Here is some output:

file: Alignments.rst
* Collections and Alignments
=   Sequences
-     Basic DnaSequence objects
^       Constructing a SequenceCollection or Alignment object from strings
^       Loading a collection or alignment from a file
^       Converting a SequenceCollection to FASTA format
^       The elements of a collection or alignment
^       Access individual sequences
^       Keeping a subset of sequences from the alignment
^       Parsing files with many sequences
^       Loading protein sequences in a Phylip file
^       Loading FASTA sequences from an open file or list of lines
^       Loading DNA sequences from a GenBank file
-     Alignments
^       Creating an Alignment object from a SequenceCollection
^       Converting an alignment to FASTA format
^       Converting an alignment into Phylip format
^       Converting an alignment to a list of strings
-     Slicing an alignment
^       By rows (sequences)
^       Getting a single column from an alignment
^       Getting a region of contiguous columns
^       Getting codon 3rd positions from an alignment
^       Filtering positions
^       Filtering sequences
-     Motifs
^       Computing motif probabilities from an alignment
^       Obtaining one column from a slice of an alignment
^       Filtering a single column for a character
^       Calculating gap fractions for each column in an alignment
^       Getting all variable positions from an alignment
^       Getting all variable codons from an alignment
^       Remove all gaps from an alignment in FASTA format
^       Getting the third sequence from an Alignment as a Sequence object
^       Getting 3rd positions from codons
^       Getting 1st and 2nd positions from codons
=   Trees
-     Selecting subtrees
-     Drawing trees

And here is the code:

import sys,os
try:
    fn = sys.argv[1]
    L = [fn]
except IndexError:
    dL = os.listdir(os.getcwd())
    L = [fn for fn in dL if fn.endswith('.rst')]
    
for fn in L:
    print 'file:', fn
    FH = open(fn,'r')
    data = FH.read().strip().split('\n')
    FH.close()
    
    targets = '*=-^"+'
    results = list()
    def f(c):
        i = targets.index(c)
        i += 1
        return c + ((2*i-1) * ' ')
    
    for i,line in enumerate(data[:-1]):
        next = data[i+1]
        if not next:  continue
        c = next[0]
        if c in targets and next[1] in targets:
            results.append(f(c) + line)    
    print '\n'.join(results)
    print