## Wednesday, March 2, 2011

### Handling large sequence sets (2)

Continuing with analysis of the Illumina data fromGawronski et al 2009 PMID 19805314, first post here.

We're trying to map short sequence reads to a bacterial genome. Ultimately, we'll use `BLAST` and `uclust`, but I was curious to see how far we could get with Python alone. The first approach used what is basically brute force. The second involves a bit more strategy. Since we'll discard reads with more than 2 mismatches, we will require large regions of identity to the genome. Thus, we can require that most subsequences match exactly.

Let's consider 12-mers. 412 is

 `>>> 4**1216777216`

which means the chance of an exact match by accident in this genome is about 10%. We'll look at three 12-mers (pos 4-16, 17-29, 30-42). This is pushing it a bit, because the reads start to deteriorate at the end. But I want to do more than just two comparisons (so we don't get misled by the odd error), and I don't want too many random matches. Probably we could use overlapping 12-mers but I didn't try that.

The script uses Python's `find` method to find exact matches. We do some arithmetic on the match positions and if at least two out of three 12-mers align correctly, we report that as the genome position. There is a bit of code to allow command line args to control how many sequences we do.

Here is the output (verbose) for the first three:

 `> python better.py 4 (706013, '+', 1)16 (706025, '+', 1)28 (706037, '+', 1)>5859 706009 +.................... 4 (844103, '+', 1)16 (844115, '+', 1)28 (844127, '+', 1)>5871 844099 +.................... 4 (110686, '-', 0)16 (110674, '-', 0)28 (1639804, '+', 1)>6135 110702 -....................`

For sequence 5859, we found matches on the '+' strand with the correct separation. Similarly for the second sequence. For the third one, we match only 2/3 on the '-' strand, but that's good enough for me. The first 12-mer matched at 110686, we add 12 + 4 to find the position where the read started.

The end of a run with non-verbose output and 100 reads looks like this:

 `>25631 1059930 +>25664 -1 o>25727 597085 +time: 5.05`

The second one in that group failed to match. At 5 seconds per 100, to do 500,000 is:

 `>>> 5000*5/3600.06.9444444444444446`

7 hours, which is substantially better than last time. Let's take a look at the first one:

 `>>> from utils import load_data>>> db = load_data('hinf.fna')>>> d = db[706009:706009+53]>>> d'TAGTAATCCTTGTTTATGAGATAATCTCTCGCATCTTTTGTTTTATAAATTAC'>>> seqs = load_data('first10000.fna').split('>')[1]>>> s = seqs.split('\n')[1]>>> L = list()>>> for c1,c2 in zip(d,s):... if c1 == c2: L.append('|')... else: L.append(' ')... >>> print '\n'.join((d,''.join(L),s))TAGTAATCCTTGTTTATGAGATAATCTCTCGCATCTTTTGTTTTATAAATTAC||||||||||||||||||||||||||||||||||||||||| |||||||||| TAGTAATCCTTGTTTATGAGATAATCTCTCGCATCTTTTGTATTATAAATTAA`

Looks pretty good to me.

`decent.py`

 `import sys, timefrom string import maketransfrom utils import load_datatt = maketrans('ACGT','TGCA')def rc(seq): return seq[::-1].translate(tt)db = load_data('hinf.fna')seqL = load_data('first10000.fna').strip().split('>')[1:]def try_one(s): i = db.find(s) if i != -1: return i, '+', db.count(s) rs = rc(s) i = db.find(rs) if i != -1: return i, '-', db.count(s) def count_matches(results,N,k): for i,r in enumerate(results): for j in range(i): s = results[j] if not r or not s: continue diff = abs(r[0]-s[0]) strand = r[1] if diff in [N, N*2]: if strand != s[1]: # opposite strand continue match = r[0] if strand == '+': return match - (i*N + k), '+' else: return match + ((i+1)*N + k), '-' return -1, 'o' N = 12k = 4n = 10v = Trueif len(sys.argv) > 1: arg = sys.argv[1] if arg == 'all': print n n = len(seqL) v = False else: n = int(arg) v = n < 11start = time.time()for item in seqL[:n]: title,seq = item.strip().split('\n') rL = list() for i in [k,k+N,k+2*N]: result = try_one(seq[i:i+N]) if result: if v: print str(i).rjust(2), result rL.append(result) else: rL.append(None) print '>' + title, t = count_matches(rL,N,k) print t[0], t[1] if v: print '.'*20print 'time: ', round(time.time() - start, 2)`