Gawronski et al 2009 PMID 19805314
They reported an STM study of Haemophilus infuenzae in the murine lung. The data we're working with consists of > 500,000 Illumina reads from the library (before in vivo selection), which we previously mapped to the genome sequence and processed to remove duplicated targets.
The results are in
unique.txt(8.7 MB), in a format like this:
where the first element is the sequence id and the second is the genome position. I wrote a script (
consolidate.py) to process this info into a list of genome positions and the count of reads recovered for each site. That's in
ordered.txt, in a format like this:
The next step is to get information about the known genes. I should use PyCogent for this, but the example I wrote up for the Cookbook isn't working for this right now. I'll have to look into that issue but I'm in a hurry so.. I go to the Genome project page and the Genes link and download a file called
gene_result.txt. It looks like this:
I wrote a simple script (
extract_genes.py) to parse that info into this format:
The second element indicates whether the gene encodes RNA or pep (protein). That is in
genes.txt. My totals are not quite matching the official numbers, but this is only a demo so we're moving on.
The next step is to use the gene info to assign a gene to each hit, if it's in a gene. The script to do that is
assign.py, and the results are in
assigned.txtand they look like this:
The next step is to reverse the association. Rather than position => number of hits => gene, I want gene => position => number of hits. The script for that is
reverse.pyand the results are in
genehits.txt. They look like this:
Now (finally) we can do something interesting. In
analyze.py, we examine various statistics for each gene. At first, we will look only at genes that suffered hits. The results are sorted by the number of hits and secondarily by gene name. The end of the list looks like this:
The first two elements are the gene name and the size (in amino acids, with some off-by-one errors).
The next four elements are:
the ratio of the number of positions with hits to the size
the number of positions with hits
the ratio of the number of total hits to the size
the number of total hits.
In parentheses at the end is an attempt at quantifying the distribution of hits in each gene. The first position in the tuple indicates the number of hits mapping between 10% to 90% of the gene length, and the second position ar those at the extreme ends of the gene. On the average, the ratio of these two numbers should be about 4:1.
This one is for nostalgia (link):
We notice that the very last gene on the list seems to be a hot spot for the transposon. At the beginning of this list are genes that are certainly essential. For example:
These are all of a reasonable size, with only one hit. At a guess, these "hits" could be reads that were mis-called against the genome. And then there are ones in the middle like this:
These have a number of hits but the distribution is skewed to the ends of the gene (probably the distal end). The inside hits are probably mis-calls.
Next time, we'll do some histograms for individual genes. Also, I'd like to reproduce Fig 2 from the paper. That will require processing the data from the library after it's gone through selection.
Finally, there are a number of genes with no hits at all. If they are of a reasonable size, these are also quite likely to be essential. Here are the ones that aren't HI.. types (lots of ribosomal proteins):
Zipped script files on Dropbox (here).