Python for Bioinformatics: Dental project (2)

Wednesday, March 9, 2011

Dental project (2)

This is a series, first post here. Before we do anything else, we need to clean up the titles on the sequences. They come from Genbank like this:

>gi|324104022|gb|HQ894465.1| Uncultured bacterium clone DA19 16S ribosomal RNA gene, partial sequence

We want this:

>DA_19

.

We could do something like a regular expression, but that's overkill. Note that the alpha part is variable length, so we have to be a little bit smart. But these are so regular, it's easy. Also, UniFrac wants an underscore, so we add that.

Just do this from the command line:

> python retitle.py > seqs.txt

If you count the sequences, you should have 1124.

retitle.py

import utils as ut
digits = '0123456789'

data = ut.load_data('results.txt')
data = data.strip().split('\n\n')
for item in data:
    title,seq = ut.clean_fasta(item)
    e = title.split()[4]
    d = ''.join([c for c in e if c in digits])
    s = ''.join([c for c in e if not c in digits])
    print '>' + s + '_' + d
    print seq
    print