The goal is to make a library of transposon insertion mutants of a bacterial pathogen, then use the pool to infect the model organism (here the murine lung). We're interested in this as a "negative" selection, that is, we want to know which individual mutants fail to either grow or survive in the selective environment.
The sequences are from the Illumina platform and the data is on the PNAS server as a supplemental data file (44 MB!). To begin with we'll play with a small number of sequences (25, in 'sub.fna'). The organism is Haemophilus influenzae and the genome sequence is Genbank L42023.
We can get the genome with PyCogent. To make things easier, we'll strip out the title and newlines.
The genome sequence has a few issues. I use a utility function to load the data (I'm sure you can figure that one out):
The length is correct. Compare with this. But the sequence is definitely not the best I've ever seen :)
The Illumina reads are all about 53 nt long. Our problem is to map them to their genome positions, if possible. The difficulties include:
- sequencing errors
- repeated sequences in the genome
To begin with, here is an attempt to "roll our own" method (bad idea---the name of the script says it all). For each read, we slide a window along the genome sequence and compare to that subsequence, one nucleotide at a time. We bail on the subsequence if the number of mismatches exceeds T = 3.
The script prints the id of the read, the strand of the genome that matched, the matching sequence and below it the read, followed by the time taken for that search. No match was found for sequence #6299, and that search took the maximum, about 25 seconds. If the average search takes 10 seconds and we have about 500,000 reads, we can expect to finish in this many hours:
There are better ways.