We have a single function in a C module that will be compiled into a shared dynamic library. Then we load that library from Python, using ctypes, and call the function. The heavy lifting is done by the function
score
. The arguments are:• the DNA sequence as a string (a str on the Python side)
• an array of floats which holds the scoring table
(in order acgt for each position in the site)
• the length of the site (n = 2 in the example)
• an array to hold the result for each position in the sequence
other variables:
• loop counters / index variables i, j, k
• the current character, c, so we can print it
• the current score for this char, d, also so we can print it
• the current score for this position in the sequence, r
The logic is straightforward and elementary. (Although perhaps there are faster approaches that people might suggest?)
mylib.c
We build the shared library using clang (or gcc, a quick test showed no speed difference):
The Python script is a little more unusual. The array initialization was just adapted from the docs (here), I don't yet understand how it works. (It seems like a pointer would be better).
Here is the second of those:
And I was surprised that we don't have to worry about the types for arguments to our score function---that's why those lines are commented out at the top.
For this toy example, the DNA sequence is 14 nt, and the scoring table is given in the list L. Here is the script:
script.py
Here's the output. Each position is evaluated on a separate line. The last six values printed are the results received back in Python.
Not bad for 50 lines of code and a little more than an hour of work! If you want to see it go fast, comment out the
printf
statements in the C code, and change:dna = 'actgtcgactcgag'*400000
Now the DNA length is 5.6E6, and it finishes in about 5 seconds. It's impressive.