Python for Bioinformatics: Timing complementary DNA

Monday, November 30, 2009

Timing complementary DNA

I received an email the other day from someone in Tunisia, letting me know that he is also posting about Python and Bioinformatics (example). This particular post is about the very simple problem of generating the "bottom" strand from the "top" strand, i.e. substituting each complementary base in a list or a string representatation of a DNA sequence. There are several ways to do this, and I decided to test them for speed, recognizing of course that "premature optimization is the root of all evil" (Knuth is cited here but not referenced).

I used the standard Python module timeit. The class initializer takes two strings of "stmt" or code, and "setup", and has a method of the same name that takes the number of repetitions as an argument.

I tested four approaches:

• if ... elif ... elif ... elif
• a list comprehension with a dictionary lookup
• sequential string replacement dna = dna.replace(n1,n2)
• use of the string module method maketrans

The results for 10,000 reps were as follows:

method    time  ratio
if elif  5.427  108.9
dict     2.026   40.6
replace  0.463    9.3
trans     0.05    1.0

import timeit

s = '''
import string,random
L = list('ACGT')*250
L2 = list()
random.shuffle(L)
D = {'A':'T','C':'G','G':'C','T':'A'}
dna = ''.join(L)
'''

c1 = '''
for nt in L:
    if nt == 'A':  L2.append('T')
    elif nt == 'C':  L2.append('G')
    elif nt == 'G':  L2.append('C')
    elif nt == 'T':  L2.append('A')
'''

c2 = '''
L2 = [D[nt] for nt in L]
'''

c3 = '''
dna = dna.replace('A','T')
dna = dna.replace('C','G')
dna = dna.replace('G','C')
dna = dna.replace('T','A')
'''

tt = '''
tt = string.maketrans('ACGT','TGCA')
'''

c4 = '''
string.translate(dna,tt)
'''

results = list()
for i,c in enumerate([c1,c2,c3,c4]):
    if i == 3:  s = s + tt
    t = timeit.Timer(c,s)
    results.append(t.timeit(int(1E4)))

m = min(results)
for r in results:
    print round(r,3), 
    print str(round(r/m,1)).rjust(6)