Sunday, November 28, 2010

Plotting the transition bias in evolution

I saw this graphic in Page & Holmes and thought it was really nice (kind of reminds me of the old letter counts script in the analysis of Ulysses (here), and I thought it would be fun to try to recreate it. Today we're going to do the left-hand plot, the one that originates from the alignment of the human and chimpanzee mitochondrial DNA sequences.

First we need the sequences of the human and chimpanzee mitochondrial genomes. Go to the NCBI page and get: Homo sapiens (NC_012920 - 16569 nt) and Pan troglodytes (NC_001643 - 16554 nt).

For reasons that surpass understanding, the two are not in synch. This can be seen from a pairwise BLAST, or by looking at the gene layouts after following the links from the main mitochondrial genomes page:

Homo    ND1     3307-4262
Pan ND1 2725-3681

Homo looks like it has about 582 nt extra on the front side, which matches reasonably well with the extra '-' inserted to the Pan sequence observed from a preliminary alignment run with MUSCLE. I saved that run of '-' to a file and counted them:

>>> FH = open('extra.txt','r')
>>> data =
>>> data.count('-')

So then I looked at the very beginning of the Pan sequence in the alignment, and searched for the same 10-mer sequence (GTTTATGTAG) in human at about the right place. To check, I pasted the preceeding human sequence into a file and measured its length:

>>> FH = open('extra.txt','r')
>>> data =
>>> data = data.replace('\n','')
>>> data
>>> len(data)

Looks good. Just cut that from the front of human and paste it at the end of the original sequence. Then rerun the alignment:

$ muscle -in homo.pan.mito.txt -out align.txt

No extra '-' to be seen on the front or back of the sequence after alignment. Overall, there are a few gaps, not many.

>>> FH = open('align.txt','r')
>>> data =
>>> homo,pan = data.split('>Pan')
>>> homo.count('-')
>>> pan.count('-')

The next step is to count the number and kind of the differences. We save a count for each pair in a dict. In the run shown here, I consolidated 'A','C' and 'C','A' (for example) as one entry. However, I used the actual values (that is A-C different than C-A) for the plot.

('-', 'A') 6
('-', 'C') 9
('-', 'G') 2
('-', 'N') 1
('-', 'T') 9
('A', 'A') 4871
('A', 'C') 63
('A', 'G') 420
('A', 'T') 47
('C', 'C') 4645
('C', 'G') 14
('C', 'T') 904
('G', 'G') 1930
('G', 'T') 6
('T', 'T') 3648
16575 total
15094 identical
1324 transition
130 transversion
27 gap

I printed a few of the transitions and transversions as a sanity check.

The plot uses scatter as we've done many times before, and I added the labels by hand. The values are log-scaled to make the smaller ones more visible. The file is on Dropbox (here).

We don't really match the original very well. It shows either C-T or T-C as larger than C-C. The plot doesn't show that, and neither do the actual counts. I don't think it can be a problem with the alignment, since the sequences are so close. Guess I'll have to track down the original reference to know why.