Python for Bioinformatics: Python mergesort

Friday, April 27, 2012

Python mergesort

I need to make a serious commitment to algorithms at some point. I have the 3rd edition of Sedgewick, and worked through volume 1 some years ago. I see the 4th edition has moved to Java; I think that's a big mistake. I would hope for a basic algorithms book in Python, but I'll take C++ over Java any day. (I also have Cormen but put it aside).

I should really buy Knuth's books, but worry they'd be beyond my abilities.

I also have Magnus Hetland's book and Zelle's also, and I think they're great, but haven't found the time to be systematic about either of them either. They are for this summer.

Just for fun, here is a modified version of mergesort (from my book).

There are a couple of ideas here. First, suppose we have two lists that are already sorted. It's easy to merge them into one sorted list, by comparing the next elements in each and choosing the correct one. That's what merge does. The second idea is recursion. That's what mergesort does. The code below is set up to read a power of ten from the input.

This is a good challenge for a beginning Python coder.

Here is some timing on mine:

> time python merge.py 3

real 0m0.054s
user 0m0.039s
sys 0m0.013s
> time python merge.py 4

real 0m0.232s
user 0m0.216s
sys 0m0.014s
> time python merge.py 5

real 0m8.201s
user 0m8.178s
sys 0m0.020s
>

It hangs with N = 1e6, I am not sure why yet.

merge.py

import random, sys

def merge(left,right):
    rL = list()
    while True:
        if left and right:
            if left[0] < right[0]:
                rL.append(left.pop(0))
            else:
                rL.append(right.pop(0))
            continue
        if left:
            rL.extend(left)
        elif right:
            rL.extend(right)
        return rL

def merge_sort(iL):
    if len(iL) == 1:
        return iL
    n = len(iL)/2
    left = merge_sort(iL[:n])
    right = merge_sort(iL[n:])
    result = merge(left,right)
    return result

if __name__ == '__main__':
    try:
        n = int(sys.argv[1])
    except:
        n = 4
    N = 10**n
    L = range(N)
    random.shuffle(L)
    result = merge_sort(L)
    assert result == sorted(L)