Python for Bioinformatics: Handling large sequence sets

I found a really impressive paper that uses what is referred to generically as "deep sequencing" to do a modern version of signature-tagged mutagenesis (wikipedia). The paper is: Gawronski et al 2009 PMID 19805314. The senior author is Brian Akerley at U Mass Medical School, and the work was done with collaborators at the Broad Institute.

The goal is to make a library of transposon insertion mutants of a bacterial pathogen, then use the pool to infect the model organism (here the murine lung). We're interested in this as a "negative" selection, that is, we want to know which individual mutants fail to either grow or survive in the selective environment.

The sequences are from the Illumina platform and the data is on the PNAS server as a supplemental data file (44 MB!). To begin with we'll play with a small number of sequences (25, in 'sub.fna'). The organism is Haemophilus influenzae and the genome sequence is Genbank L42023.

We can get the genome with PyCogent. To make things easier, we'll strip out the title and newlines.

fetch.py:

from cogent.db.ncbi import EFetch
ef = EFetch(id='L42023')
seq = ef.read().split('\n',1)[1].strip()
seq = seq.replace('\n','')
print seq

> python fetch.py > hinf.fna

The genome sequence has a few issues. I use a utility function to load the data (I'm sure you can figure that one out):

>>> from utils import load_data
>>> seq = load_data('hinf.fna')
>>> len(seq)
1830138
>>> symbols = ''.join(set(seq))
>>> symbols
'ACGKMNSRTWY'

The length is correct. Compare with this. But the sequence is definitely not the best I've ever seen :)

>>> for c in symbols:
...     if c in 'ACGT':
...         continue
...     print c, seq.count(c)
... 
K 14
M 11
N 46
S 12
R 10
W 11
Y 11

The Illumina reads are all about 53 nt long. Our problem is to map them to their genome positions, if possible. The difficulties include:

- sequencing errors
- repeated sequences in the genome

To begin with, here is an attempt to "roll our own" method (bad idea---the name of the script says it all). For each read, we slide a window along the genome sequence and compare to that subsequence, one nucleotide at a time. We bail on the subsequence if the number of mismatches exceeds T = 3.

The script prints the id of the read, the strand of the genome that matched, the matching sequence and below it the read, followed by the time taken for that search. No match was found for sequence #6299, and that search took the maximum, about 25 seconds. If the average search takes 10 seconds and we have about 500,000 reads, we can expect to finish in this many hours:

>>> 500000 * 10 / 3600.0
1388.8888888888889

There are better ways.

Output:

> python dumb.py 
seq:  5859 +
TAGTAATCCTTGTTTATGAGATAATCTCTCGCATCTTTTGTATTATAAATTAA
||||||||||||||||||||||||||||||||||||||||| |||||||||| 
TAGTAATCCTTGTTTATGAGATAATCTCTCGCATCTTTTGTTTTATAAATTAC
time: 4.86
.....................................................
seq:  5871 +
TATTGATACATGATCTTCCTTATCAAGAAGAGGAGATTGGCATCACAGAGATT
|||||||||||||||||||||||||||||||||||||||||||||||||| ||
TATTGATACATGATCTTCCTTATCAAGAAGAGGAGATTGGCATCACAGAGTTT
time: 5.75
.....................................................
seq:  6135 -
TTCTTTATCCGCTAAATCGCCTAAATCCAAATCTGCTTTTGTAATACTTTGTA
|||||||||||||||||||||||||||||||||||||||||||||||||||||
TTCTTTATCCGCTAAATCGCCTAAATCCAAATCTGCTTTTGTAATACTTTGTA
time: 13.18
.....................................................
seq:  6299 o
time: 24.68
.....................................................
seq:  6407 +
TATTAGATCCCGAGCAAAACACCACATTTAACGATCACTATTTAGAAGGGGAT
|||||||||||||||||||||||||||||||||||||||||||||||| ||||
TATTAGATCCCGAGCAAAACACCACATTTAACGATCACTATTTAGAAGTGGAT
time: 3.32
.....................................................

Code listing:

dumb.py

import string, time
from utils import load_data
db = load_data('Hinf.fna')

data = load_data('sub.fna')
data = data.strip().split('>')[1:]

tt = string.maketrans('ACGT','TGCA')
def rc(seq):
    return seq[::-1].translate(tt)

def hammingFilter(s1,s2,T):
    count = 0
    for c1,c2 in zip(s1,s2):
        if c1 != c2:
            count += 1
            if count > T:  
                return False
    return True
    
def search(seq, db, s):
    found = False
    n = len(seq)
    for i in range(len(db)-n+1):
        dseq = db[i:i+n]
        if hammingFilter(seq,dseq,3):
            found = True
            break
    if found:
        print s
        print seq
        pL = list()
        for c1,c2 in zip(seq,dseq):
            if c1 != c2:
                pL.append(' ')
            else:
                pL.append('|')
        print ''.join(pL)
        print dseq
    return found
    
prev = time.time()
for item in data[:5]:
    title,seq = item.strip().split('\n',1)
    print 'seq: ', title,
    result = search(seq,db,'+')
    if not result:
        result = search(rc(seq),db,'-')
        if not result:
            print 'o'
    current = time.time()
    print 'time:', round(current - prev,2)
    prev = current
    print '.' * 53

Wednesday, March 2, 2011

Handling large sequence sets