Wednesday, March 2, 2011

Handling large sequence sets (3)

Continuing with analysis of the Illumina data from Gawronski et al 2009 PMID 19805314, first post here and second one here.

I think the next logical step would be to try to make a fingerprint of each region based on k-mers (like BLAST does), but I'm impatient to see the structure of the data and match the insertions against the ORFs.

[ UPDATE: We look at BLAST here, but there are much better/faster tools available, and that's essential for large datasets. See here. ]

So why not go onto BLAST itself? First we need to "format" the "database":


> formatdb -i hinf.fna -p F
> ls -al hinf*
-rw-r--r--@ 1 telliott staff 1830139 Mar 2 12:32 hinf.fna
-rw-r--r-- 1 telliott staff 57 Mar 2 15:42 hinf.fna.nhr
-rw-r--r-- 1 telliott staff 88 Mar 2 15:42 hinf.fna.nin
-rw-r--r-- 1 telliott staff 457972 Mar 2 15:42 hinf.fna.nsq


We use megablast and look at the first 25 reads:


> megablast -d hinf.fna -i first25.fna -e 0.001 -m 8
5859 98.08 52 1 0 1 52 706010 706061 1e-21 95.6
5871 100.00 50 0 0 1 50 844100 844149 8e-23 99.6
6135 100.00 53 0 0 1 53 110702 110650 1e-24 105
6299 100.00 37 0 0 1 37 965968 966004 4e-15 73.8
6407 98.11 53 1 0 1 53 484019 484071 3e-22 97.6
..


The output includes the sequence's title line, the length of the alignment, and the coordinates of the hit. At this E-value, there is only a single hit for most sequences. Repeats have more than one, as we'll see.

The time doesn't look too bad:


> python -c "import time;  print time.time()"
1299110152.71
> megablast -d hinf.fna -i first10000.fna -e 10 -m 8 > x.txt
> python -c "import time; print time.time()"
1299110165.77
>


At 13 seconds per 10,000 reads, that works out to be about 11 minutes total (it was actually about twice that, see below):


>>> 13*50/60.0
10.833333333333334


Importantly, we do recover multiple hits from repeats, like this one:


24059  98.11 53 1 0 1 53 123710 123658 3e-22 97.6
24059 98.11 53 1 0 1 53 242547 242495 3e-22 97.6
24059 98.11 53 1 0 1 53 629075 629127 3e-22 97.6
24059 98.11 53 1 0 1 53 662112 662164 3e-22 97.6
24059 98.11 53 1 0 1 53 776460 776512 3e-22 97.6
24059 98.11 53 1 0 1 53 1816749 1816697 3e-22 97.6


Take a look using web BLAST at NCBI. There are five hits for this sequence in the KW20 genome. It looks like this is a part of the ribosomal RNA operon:


>gb|L42023.1|  Haemophilus influenzae Rd KW20, complete genome
Length=1830138


Sort alignments for this subject sequence by:
E value Score Percent identity
Query start position Subject start position
Features flanking this part of subject sequence:
105 bp at 5' side: rRNA-5S ribosomal RNA
91 bp at 3' side: rRNA-23S ribosomal RNA

Score = 93.5 bits (50), Expect = 5e-20
Identities = 52/53 (99%), Gaps = 0/53 (0%)
Strand=Plus/Minus

Query 1 TAAACAAAGAAAAGTAAATATAGAAGACTTAATAGAAAGAAAATAGGATTCAG 53
|||||||||||||||||||||||||||||||||||||||||||| ||||||||
Sbjct 123710 TAAACAAAGAAAAGTAAATATAGAAGACTTAATAGAAAGAAAATCGGATTCAG 123658


Now we do the local BLAST for real:


> megablast -d hinf.fna -i SD1.txt -e 10 -m 8 > results.txt


We've got warnings. There were 39 of these:


[megablast 2.2.22] WARNING: 439759: Could not calculate ungapped Karlin-Altschul parameters due to an invalid query sequence or its translation. Please verify the query sequence(s) and/or filtering options


Not very many. Ignore. Wait, wait, wait..

CPU Time in Process Viewer: a little over 19 minutes


> ls -al results.txt
-rw-r--r-- 1 telliott staff 33742303 Mar 2 16:32 results.txt


Get the first part with:


>>> from utils import load_data
>>> data = load_data('results.txt')
>>> data = data[:100000]
>>> print data
5859 98.08 52 1 0 1 52 706010 706061 1e-21 95.6
5871 100.00 50 0 0 1 50 844100 844149 8e-23 99.6
6135 100.00 53 0 0 1 53 110702 110650 1e-24 105
6299 100.00 37 0 0 1 37 965968 966004 4e-15 73.8
6407 98.11 53 1 0 1 53 484019 484071 3e-22 97.6


Looks OK. Notice that the fourth sequence, which failed in our earlier attempts, does match to 37, but we'll discard it.

Work out a filter script using this part of the data.

filter1.py

from utils import load_data
data = load_data('results3.txt')
data = data.strip().split('\n')

for item in data:
item = item.split()
i,j = item[7:9]
if int(item[2]) > 50:
print item[0],i,j




python filter1.py > med.txt



> ls -al small.txt
-rw-r--r-- 1 telliott staff 9636381 Mar 2 19:12 small.txt


Now about 10 MB. We still need to filter out repeats

filter2.py

import sys
from utils import load_data
data = load_data('med1.txt')
data = data.strip().split('\n')

dup = None
item = data[0]
seq_id = item.split()[0]
prev = seq_id
pL = [item]

for item in data[1:]:
seq_id = item.split()[0]
# new item, doing dups currently
if dup:
# another one, just go
if seq_id == dup:
continue
# a new item
else:
dup = None
pL.append(item)
prev = seq_id
# not doing dups currently
else:
if seq_id == prev:
# it's a dup, don't print
dup = seq_id
# not used..
prev = seq_id
# pop the last item
pL.pop()
else:
pL.append(item)
prev = seq_id

print '\n'.join(pL)



> python filter2.py
631756
43702


Comment out the print statements and sys.exit():


> python filter2.py > unique.txt


This part took a 2 or 3 minutes. Let's just check:


>>> from utils import load_data
>>> data = load_data('unique.txt').strip().split('\n')
>>> len(data)
571090
>>> 571090 + 43702
614792
>>> 631756 - 614792
16964


Initially, we had 631756.

Checking for dups, we found 43702 sequences that, when we tried to add, them, we found they were already present. When all the sequences in this class were removed, we had 571090. The difference, 631756 - 43702 - 571090 = 16964, is the number of individual sequences that were present multiple times due to multiple hits during BLAST.

We have 571090 sequences to process---it seems like enough. That's for next time.

[UPDATE: Had a lot of trouble with bugs in the code. Sorry for the many posted versions. ]