Python for Bioinformatics: Handling large sequence sets (2)

Wednesday, March 2, 2011

Handling large sequence sets (2)

Continuing with analysis of the Illumina data fromGawronski et al 2009 PMID 19805314, first post here.

We're trying to map short sequence reads to a bacterial genome. Ultimately, we'll use BLAST and uclust, but I was curious to see how far we could get with Python alone. The first approach used what is basically brute force. The second involves a bit more strategy. Since we'll discard reads with more than 2 mismatches, we will require large regions of identity to the genome. Thus, we can require that most subsequences match exactly.

Let's consider 12-mers. 4¹² is

>>> 4**12
16777216

which means the chance of an exact match by accident in this genome is about 10%. We'll look at three 12-mers (pos 4-16, 17-29, 30-42). This is pushing it a bit, because the reads start to deteriorate at the end. But I want to do more than just two comparisons (so we don't get misled by the odd error), and I don't want too many random matches. Probably we could use overlapping 12-mers but I didn't try that.

The script uses Python's find method to find exact matches. We do some arithmetic on the match positions and if at least two out of three 12-mers align correctly, we report that as the genome position. There is a bit of code to allow command line args to control how many sequences we do.

Here is the output (verbose) for the first three:

> python better.py 
 4 (706013, '+', 1)
16 (706025, '+', 1)
28 (706037, '+', 1)
>5859 706009 +
....................
 4 (844103, '+', 1)
16 (844115, '+', 1)
28 (844127, '+', 1)
>5871 844099 +
....................
 4 (110686, '-', 0)
16 (110674, '-', 0)
28 (1639804, '+', 1)
>6135 110702 -
....................

For sequence 5859, we found matches on the '+' strand with the correct separation. Similarly for the second sequence. For the third one, we match only 2/3 on the '-' strand, but that's good enough for me. The first 12-mer matched at 110686, we add 12 + 4 to find the position where the read started.

The end of a run with non-verbose output and 100 reads looks like this:

>25631 1059930 +
>25664 -1 o
>25727 597085 +
time:  5.05

The second one in that group failed to match. At 5 seconds per 100, to do 500,000 is:

>>> 5000*5/3600.0
6.9444444444444446

7 hours, which is substantially better than last time. Let's take a look at the first one:

>>> from utils import load_data
>>> db = load_data('hinf.fna')
>>> d = db[706009:706009+53]
>>> d
'TAGTAATCCTTGTTTATGAGATAATCTCTCGCATCTTTTGTTTTATAAATTAC'
>>> seqs = load_data('first10000.fna').split('>')[1]
>>> s = seqs.split('\n')[1]
>>> L = list()
>>> for c1,c2 in zip(d,s):
...     if c1 == c2:  L.append('|')
...     else:  L.append(' ')
... 
>>> print '\n'.join((d,''.join(L),s))
TAGTAATCCTTGTTTATGAGATAATCTCTCGCATCTTTTGTTTTATAAATTAC
||||||||||||||||||||||||||||||||||||||||| |||||||||| 
TAGTAATCCTTGTTTATGAGATAATCTCTCGCATCTTTTGTATTATAAATTAA

Looks pretty good to me.

decent.py

import sys, time
from string import maketrans
from utils import load_data

tt = maketrans('ACGT','TGCA')
def rc(seq):
    return seq[::-1].translate(tt)

db = load_data('hinf.fna')
seqL = load_data('first10000.fna').strip().split('>')[1:]

def try_one(s):
    i = db.find(s)
    if i != -1:
        return i, '+', db.count(s)
    rs = rc(s)
    i = db.find(rs)
    if i != -1:
        return i, '-', db.count(s)
        
def count_matches(results,N,k):
    for i,r in enumerate(results):
        for j in range(i):
            s = results[j]
            if not r or not s:
                continue
            diff = abs(r[0]-s[0])
            strand = r[1]
            if diff in [N, N*2]:
                if strand != s[1]:   # opposite strand
                    continue
            match = r[0]
            if strand == '+':
                return match - (i*N + k), '+'
            else:
                return match + ((i+1)*N + k), '-'
    return -1, 'o'
            
N = 12
k = 4
n = 10
v = True
if len(sys.argv) > 1:
    arg = sys.argv[1]
    if arg == 'all':
        print n
        n = len(seqL)
        v = False
    else:
        n = int(arg)
        v = n < 11

start = time.time()
for item in seqL[:n]:
    title,seq = item.strip().split('\n')
    rL = list()
    for i in [k,k+N,k+2*N]:
        result = try_one(seq[i:i+N])
        if result:
            if v:  print str(i).rjust(2), result
            rL.append(result)
        else:
            rL.append(None)
    print '>' + title, 
    t = count_matches(rL,N,k)
    print t[0], t[1]
    if v:  print '.'*20
print 'time: ', round(time.time() - start, 2)