Python for Bioinformatics: Handling large sequence sets (4)

Just to recap, I'm doing a project involving a paper:
Gawronski et al 2009 PMID 19805314

They reported an STM study of Haemophilus infuenzae in the murine lung. The data we're working with consists of > 500,000 Illumina reads from the library (before in vivo selection), which we previously mapped to the genome sequence and processed to remove duplicated targets.

The results are in unique.txt (8.7 MB), in a format like this:

5859 706010
5871 844100
..

where the first element is the sequence id and the second is the genome position. I wrote a script (consolidate.py) to process this info into a list of genome positions and the count of reads recovered for each site. That's in ordered.txt, in a format like this:

The next step is to get information about the known genes. I should use PyCogent for this, but the example I wrote up for the Cookbook isn't working for this right now. I'll have to look into that issue but I'm in a hurry so.. I go to the Genome project page and the Genes link and download a file called gene_result.txt. It looks like this:

1. accA
acetyl-CoA carboxylase carboxyltransferase subunit alpha[Haemophilus influenzae Rd KW20]
Other Aliases: HI0406
Genomic context: Chromosome
Annotation: NC_000907.1 (427035..427982, complement)
ID: 949505

2. accB
acetyl-CoA carboxylase biotin carboxyl carrier protein subunit[Haemophilus influenzae Rd KW20]
Other Aliases: HI0971
Genomic context: Chromosome
Annotation: NC_000907.1 (1028137..1028604)
ID: 949974
..

I wrote a simple script (extract_genes.py) to parse that info into this format:

accA pep acetyl-CoA_carboxylase_carboxyltransferase_subunit_alpha 427035 427982 -
accB pep acetyl-CoA_carboxylase_biotin_carboxyl_carrier_protein_subunit 1028137 1028604 +
accC pep acetyl-CoA_carboxylase_biotin_carboxylase_subunit 1028779 1030125 +
..

The second element indicates whether the gene encodes RNA or pep (protein). That is in genes.txt. My totals are not quite matching the official numbers, but this is only a demo so we're moving on.

The next step is to use the gene info to assign a gene to each hit, if it's in a gene. The script to do that is assign.py, and the results are in assigned.txt and they look like this:

1010 8 gapdH
1011 4 gapdH
1016 1 gapdH
1023 1
1024 6
..

The next step is to reverse the association. Rather than position => number of hits => gene, I want gene => position => number of hits. The script for that is reverse.py and the results are in genehits.txt. They look like this:

Now (finally) we can do something interesting. In analyze.py, we examine various statistics for each gene. At first, we will look only at genes that suffered hits. The results are sorted by the number of hits and secondarily by gene name. The end of the list looks like this:

oapA        1011   0.41  412   2.49  2513     (322, 90)
iga1        1695   0.38  650   2.34  3963     (517, 133)
HI1480       157   0.23   36  67.82 10647     (34, 2)

The first two elements are the gene name and the size (in amino acids, with some off-by-one errors).

The next four elements are:

the ratio of the number of positions with hits to the size
the number of positions with hits
the ratio of the number of total hits to the size
the number of total hits.

In parentheses at the end is an attempt at quantifying the distribution of hits in each gene. The first position in the tuple indicates the number of hits mapping between 10% to 90% of the gene length, and the second position ar those at the extreme ends of the gene. On the average, the ratio of these two numbers should be about 4:1.

This one is for nostalgia (link):

hemH         324   0.26   84   1.00   323     (61, 23)

We notice that the very last gene on the list seems to be a hot spot for the transposon. At the beginning of this list are genes that are certainly essential. For example:

def          170   0.01    1   0.01     1     (1, 0)
dnaA         455   0.00    1   0.00     1     (1, 0)
engA         505   0.00    1   0.00     1     (1, 0)
fabI         263   0.00    1   0.00     1     (0, 1)
gptA         156   0.01    1   0.01     1     (1, 0)

These are all of a reasonable size, with only one hit. At a guess, these "hits" could be reads that were mis-called against the genome. And then there are ones in the middle like this:

pta          712   0.01    8   0.05    33     (1, 7)
..
adk          215   0.04    9   0.18    38     (0, 9)
..
murD         438   0.03   11   0.09    38     (1, 10)

These have a number of hits but the distribution is skewed to the ends of the gene (probably the distal end). The inside hits are probably mis-calls.

Next time, we'll do some histograms for individual genes. Also, I'd like to reproduce Fig 2 from the paper. That will require processing the data from the library after it's gone through selection.

Finally, there are a number of genes with no hits at all. If they are of a reasonable size, these are also quite likely to be essential. Here are the ones that aren't HI.. types (lots of ribosomal proteins):

acpP          77
ddl          307
folP-1       276
folP-2       276
frr          186
ftsH         636
glmM         446
lpxK         333
mviN         511
pgk          387
pheS         330
ribH         158
rpS11        130
rpS7         157
rplC         209
rplE         143
rplF         178
rplL         124
rplM         143
rplN         124
rplO         145
rplR         118
rplU         104
rplV         111
rplW         100
rplX         104
rpmA          86
rpsC         236
rpsD         207
rpsE         167
rpsF         126
rpsH         131
rpsI         131
rpsJ         104
rpsL         125
rpsM         123
rpsN         102
rpsO          90
rpsP          83
rpsS          92
secE         107
tRNA-Arg-1    77
tRNA-Glu-1    76
tRNA-His-1    76
tRNA-Ile-2    77

Zipped script files on Dropbox (here).

Thursday, March 3, 2011

Handling large sequence sets (4)