Wednesday, March 9, 2011

Dental project (2)

This is a series, first post here. Before we do anything else, we need to clean up the titles on the sequences. They come from Genbank like this:

>gi|324104022|gb|HQ894465.1| Uncultured bacterium clone DA19 16S ribosomal RNA gene, partial sequence

We want this:


We could do something like a regular expression, but that's overkill. Note that the alpha part is variable length, so we have to be a little bit smart. But these are so regular, it's easy. Also, UniFrac wants an underscore, so we add that.

Just do this from the command line:

> python > seqs.txt

If you count the sequences, you should have 1124.

import utils as ut
digits = '0123456789'

data = ut.load_data('results.txt')
data = data.strip().split('\n\n')
for item in data:
title,seq = ut.clean_fasta(item)
e = title.split()[4]
d = ''.join([c for c in e if c in digits])
s = ''.join([c for c in e if not c in digits])
print '>' + s + '_' + d
print seq