Python for Bioinformatics: Exploring mitochondrial genomes

Thursday, December 16, 2010

Exploring mitochondrial genomes

Following the ape book (Phylogenetics in R; starting on p. 55), they use a data set in a file named mammal_mtGenome.fasta. It's not part of the datasets with the ape package (too big), but a Google search led me to versions of Ch. 3 on the web, so you can follow along even if you don't have the book.

The example uses a database called OGRe---Organellar Genome Retrieval system (here). It has 1244 organisms! What's nice is that they've organized the data by genes and they have the nucleotide sequences.

If we go to select species from taxonomy and expand the phylogeny: Chordata > Vertebrata > Mammalia, we can get a checkbox for all mammals, then at the bottom of the page do display sequences and from the table at the very bottom choose these genes: rRNAs plus protein encoding genes: ATP6 .. ND6. I saved the download in 'sequences.fasta' and it's 2.7 MB.

The data file contains a header, followed by a section with entries like:

#HOMSAPMIT : _Homo sapiens_ (human) : MITOCHONDRION

after that we find the sequences of the genes, grouped by gene (all the ATP6 genes first, then ATP8 all the way to ND6 and beyond). Those entries look like this:

>GALPYRMIT(ATP6)
atgaac..

>ORYCUNMIT(12SrRNA)Sequence does not exist

A double newline ('\n\n') is found between successive "does not exist" entries, but '\n\n\n' between other entries. There's a total of 233 species. Parsing the data file is a fairly trivial Python exercise. We use the fact that the organism code name is the first element, that the genus contains '_' as its first letter while the species contains '_' as its last letter, and that the common name is enclosed in parentheses. We sort and print the results. Next time, we'll do something with the data. Here we just print the first few lines:

Acinonyx jubatus               ACIJUBMIT    cheetah
Ailuropoda melanoleuca         AILMELMIT    giant panda
Ailurus styani                 FULSTYMIT    red panda
Ammotragus lervia              AMMLERMIT    aoudad
Anomalurus GP-2005             ANOSP1MIT    scaly-tailed squirrel
Arctocephalus forsteri         ARCFORMIT    New Zealand fur seal
Arctocephalus pusillus         ARCPUSMIT    cape fur seal
Arctocephalus townsendi        ARCTOWMIT    Guadalupe fur seal
Artibeus jamaicensis           ARTJAMMIT    Jamaican fruit-eating bat

Code listing:

import matplotlib.pyplot as plt

fn = 'sequences.fasta'
FH = open(fn,'r')
data = FH.read().strip()
FH.close()

results = list()
header, defs, rest = data.split('\n\n',2)
for line in defs.strip().split('\n'):
    L = line.split()
    code = L[0][1:]
    genus = [w for w in L if w[0] == '_'][0][1:]
    species = [w for w in L if w[-1] == '_'][0][:-1]
    i = line.find('(')
    j = line.find(')')
    name = line[i+1:j]
    results.append((genus,species,code,name))

def test():
    print len(results)
    results.sort()
    for g,s,c,n in results:
        print (g + ' ' + s).ljust(30), c, '  ', n

test()