We're trying to map short sequence reads to a bacterial genome. Ultimately, we'll use
uclust, but I was curious to see how far we could get with Python alone. The first approach used what is basically brute force. The second involves a bit more strategy. Since we'll discard reads with more than 2 mismatches, we will require large regions of identity to the genome. Thus, we can require that most subsequences match exactly.
Let's consider 12-mers. 412 is
which means the chance of an exact match by accident in this genome is about 10%. We'll look at three 12-mers (pos 4-16, 17-29, 30-42). This is pushing it a bit, because the reads start to deteriorate at the end. But I want to do more than just two comparisons (so we don't get misled by the odd error), and I don't want too many random matches. Probably we could use overlapping 12-mers but I didn't try that.
The script uses Python's
findmethod to find exact matches. We do some arithmetic on the match positions and if at least two out of three 12-mers align correctly, we report that as the genome position. There is a bit of code to allow command line args to control how many sequences we do.
Here is the output (verbose) for the first three:
For sequence 5859, we found matches on the '+' strand with the correct separation. Similarly for the second sequence. For the third one, we match only 2/3 on the '-' strand, but that's good enough for me. The first 12-mer matched at 110686, we add 12 + 4 to find the position where the read started.
The end of a run with non-verbose output and 100 reads looks like this:
The second one in that group failed to match. At 5 seconds per 100, to do 500,000 is:
7 hours, which is substantially better than last time. Let's take a look at the first one:
Looks pretty good to me.