mammal_mtGenome.fasta
. It's not part of the datasets with the ape package (too big), but a Google search led me to versions of Ch. 3 on the web, so you can follow along even if you don't have the book.The example uses a database called OGRe---Organellar Genome Retrieval system (here). It has 1244 organisms! What's nice is that they've organized the data by genes and they have the nucleotide sequences.
If we go to select species from taxonomy and expand the phylogeny: Chordata > Vertebrata > Mammalia, we can get a checkbox for all mammals, then at the bottom of the page do display sequences and from the table at the very bottom choose these genes: rRNAs plus protein encoding genes: ATP6 .. ND6. I saved the download in 'sequences.fasta' and it's 2.7 MB.
The data file contains a header, followed by a section with entries like:
after that we find the sequences of the genes, grouped by gene (all the ATP6 genes first, then ATP8 all the way to ND6 and beyond). Those entries look like this:
or
A double newline ('\n\n') is found between successive "does not exist" entries, but '\n\n\n' between other entries. There's a total of 233 species. Parsing the data file is a fairly trivial Python exercise. We use the fact that the organism code name is the first element, that the genus contains '_' as its first letter while the species contains '_' as its last letter, and that the common name is enclosed in parentheses. We sort and print the results. Next time, we'll do something with the data. Here we just print the first few lines:
Code listing: