I think the next logical step would be to try to make a fingerprint of each region based on k-mers (like BLAST does), but I'm impatient to see the structure of the data and match the insertions against the ORFs.
[ UPDATE: We look at BLAST here, but there are much better/faster tools available, and that's essential for large datasets. See here. ]
So why not go onto BLAST itself? First we need to "format" the "database":
We use
megablast
and look at the first 25 reads:The output includes the sequence's title line, the length of the alignment, and the coordinates of the hit. At this E-value, there is only a single hit for most sequences. Repeats have more than one, as we'll see.
The time doesn't look too bad:
At 13 seconds per 10,000 reads, that works out to be about 11 minutes total (it was actually about twice that, see below):
Importantly, we do recover multiple hits from repeats, like this one:
Take a look using web BLAST at NCBI. There are five hits for this sequence in the KW20 genome. It looks like this is a part of the ribosomal RNA operon:
Now we do the local BLAST for real:
We've got warnings. There were 39 of these:
Not very many. Ignore. Wait, wait, wait..
CPU Time in Process Viewer: a little over 19 minutes
Get the first part with:
Looks OK. Notice that the fourth sequence, which failed in our earlier attempts, does match to 37, but we'll discard it.
Work out a filter script using this part of the data.
filter1.py
Now about 10 MB. We still need to filter out repeats
filter2.py
Comment out the print statements and sys.exit():
This part took a 2 or 3 minutes. Let's just check:
Initially, we had 631756.
Checking for dups, we found 43702 sequences that, when we tried to add, them, we found they were already present. When all the sequences in this class were removed, we had 571090. The difference, 631756 - 43702 - 571090 = 16964, is the number of individual sequences that were present multiple times due to multiple hits during BLAST.
We have 571090 sequences to process---it seems like enough. That's for next time.
[UPDATE: Had a lot of trouble with bugs in the code. Sorry for the many posted versions. ]