Python for Bioinformatics: Handling large sequence sets (3)

Continuing with analysis of the Illumina data from Gawronski et al 2009 PMID 19805314, first post here and second one here.

I think the next logical step would be to try to make a fingerprint of each region based on k-mers (like BLAST does), but I'm impatient to see the structure of the data and match the insertions against the ORFs.

[ UPDATE: We look at BLAST here, but there are much better/faster tools available, and that's essential for large datasets. See here. ]

So why not go onto BLAST itself? First we need to "format" the "database":

> formatdb -i hinf.fna -p F
> ls -al hinf*
-rw-r--r--@ 1 telliott  staff  1830139 Mar  2 12:32 hinf.fna
-rw-r--r--  1 telliott  staff       57 Mar  2 15:42 hinf.fna.nhr
-rw-r--r--  1 telliott  staff       88 Mar  2 15:42 hinf.fna.nin
-rw-r--r--  1 telliott  staff   457972 Mar  2 15:42 hinf.fna.nsq

We use megablast and look at the first 25 reads:

> megablast -d hinf.fna -i first25.fna -e 0.001 -m 8
5859  98.08 52 1 0 1 52 706010 706061 1e-21 95.6
5871  100.00 50 0 0 1 50 844100 844149 8e-23 99.6
6135  100.00 53 0 0 1 53 110702 110650 1e-24  105
6299  100.00 37 0 0 1 37 965968 966004 4e-15 73.8
6407  98.11 53 1 0 1 53 484019 484071 3e-22 97.6
..

The output includes the sequence's title line, the length of the alignment, and the coordinates of the hit. At this E-value, there is only a single hit for most sequences. Repeats have more than one, as we'll see.

The time doesn't look too bad:

> python -c "import time;  print time.time()"
1299110152.71
> megablast -d hinf.fna -i first10000.fna -e 10 -m 8 > x.txt
> python -c "import time;  print time.time()"
1299110165.77
>

At 13 seconds per 10,000 reads, that works out to be about 11 minutes total (it was actually about twice that, see below):

>>> 13*50/60.0
10.833333333333334

Importantly, we do recover multiple hits from repeats, like this one:

24059  98.11 53 1 0 1 53 123710 123658 3e-22 97.6
24059  98.11 53 1 0 1 53 242547 242495 3e-22 97.6
24059  98.11 53 1 0 1 53 629075 629127 3e-22 97.6
24059  98.11 53 1 0 1 53 662112 662164 3e-22 97.6
24059  98.11 53 1 0 1 53 776460 776512 3e-22 97.6
24059  98.11 53 1 0 1 53 1816749 1816697 3e-22 97.6

Take a look using web BLAST at NCBI. There are five hits for this sequence in the KW20 genome. It looks like this is a part of the ribosomal RNA operon:

>gb|L42023.1|  Haemophilus influenzae Rd KW20, complete genome
Length=1830138


                                                        Sort alignments for this subject sequence by:
                                                          E value  Score  Percent identity
                                                          Query start position  Subject start position
Features flanking this part of subject sequence:
  105 bp at 5' side: rRNA-5S ribosomal RNA
  91 bp at 3' side: rRNA-23S ribosomal RNA

Score = 93.5 bits (50),  Expect = 5e-20
Identities = 52/53 (99%), Gaps = 0/53 (0%)
Strand=Plus/Minus

Query  1       TAAACAAAGAAAAGTAAATATAGAAGACTTAATAGAAAGAAAATAGGATTCAG  53
              |||||||||||||||||||||||||||||||||||||||||||| ||||||||
Sbjct  123710  TAAACAAAGAAAAGTAAATATAGAAGACTTAATAGAAAGAAAATCGGATTCAG  123658

Now we do the local BLAST for real:

> megablast -d hinf.fna -i SD1.txt -e 10 -m 8 > results.txt

We've got warnings. There were 39 of these:

[megablast 2.2.22] WARNING: 439759: Could not calculate ungapped Karlin-Altschul parameters due to an invalid query sequence or its translation. Please verify the query sequence(s) and/or filtering options

Not very many. Ignore. Wait, wait, wait..

CPU Time in Process Viewer: a little over 19 minutes

> ls -al results.txt
-rw-r--r--  1 telliott  staff  33742303 Mar  2 16:32 results.txt

Get the first part with:

>>> from utils import load_data
>>> data = load_data('results.txt')
>>> data = data[:100000]
>>> print data
5859  98.08 52 1 0 1 52 706010 706061 1e-21 95.6
5871  100.00 50 0 0 1 50 844100 844149 8e-23 99.6
6135  100.00 53 0 0 1 53 110702 110650 1e-24  105
6299  100.00 37 0 0 1 37 965968 966004 4e-15 73.8
6407  98.11 53 1 0 1 53 484019 484071 3e-22 97.6

Looks OK. Notice that the fourth sequence, which failed in our earlier attempts, does match to 37, but we'll discard it.

Work out a filter script using this part of the data.

filter1.py

from utils import load_data
data = load_data('results3.txt')
data = data.strip().split('\n')

for item in data:
    item = item.split()
    i,j = item[7:9]
    if int(item[2]) > 50:
        print item[0],i,j


python filter1.py > med.txt

> ls -al small.txt
-rw-r--r--  1 telliott  staff  9636381 Mar  2 19:12 small.txt

Now about 10 MB. We still need to filter out repeats

filter2.py

import sys
from utils import load_data
data = load_data('med1.txt')
data = data.strip().split('\n')
 
dup = None
item = data[0]
seq_id = item.split()[0]
prev = seq_id
pL = [item]

for item in data[1:]:
    seq_id = item.split()[0]
    # new item, doing dups currently
    if dup:
        # another one, just go
        if seq_id == dup:
            continue
        # a new item
        else:
            dup = None
            pL.append(item)
            prev = seq_id
    # not doing dups currently
    else:
        if seq_id == prev:
            # it's a dup, don't print
            dup = seq_id
            # not used..
            prev = seq_id
            # pop the last item
            pL.pop()
        else:
            pL.append(item)
            prev = seq_id
            
print '\n'.join(pL)

> python filter2.py
631756
43702

Comment out the print statements and sys.exit():

> python filter2.py > unique.txt

This part took a 2 or 3 minutes. Let's just check:

>>> from utils import load_data
>>> data = load_data('unique.txt').strip().split('\n')
>>> len(data)
571090
>>> 571090 + 43702
614792
>>> 631756 - 614792
16964

Initially, we had 631756.

Checking for dups, we found 43702 sequences that, when we tried to add, them, we found they were already present. When all the sequences in this class were removed, we had 571090. The difference, 631756 - 43702 - 571090 = 16964, is the number of individual sequences that were present multiple times due to multiple hits during BLAST.

We have 571090 sequences to process---it seems like enough. That's for next time.

[UPDATE: Had a lot of trouble with bugs in the code. Sorry for the many posted versions. ]

Wednesday, March 2, 2011

Handling large sequence sets (3)