Python for Bioinformatics: Chunks of sequence in Python

Saturday, January 23, 2010

Chunks of sequence in Python

I achieved enlightenment in Python this morning. It's quite exciting :)

I asked a question on Stack Overflow about generating "chunks" of sequence, which is an everyday thing in bioinformatics: a DNA seq -> groups of 3 nt per codon. I do this in a simple way (as discussed here). And one of the answers to this question (look for "chunks") is the same as my method but employs a generator.

However, many (most?) people prefer this solution (from the recipes section of the itertools docs)

def grouper(n, iterable, fillvalue=None):
    "grouper(3, 'ABCDEFG', 'x') --> ABC DEF Gxx"
    args = [iter(iterable)] * n
    return izip_longest(fillvalue=fillvalue, *args)

Where the docs for izip_longest (renamed zip_longest in Python 3) show this:

def izip_longest(*args, **kwds):
    # izip_longest('ABCD', 'xy', fillvalue='-') --> Ax By C- D-

I've been trying to understand how this works. On the first line of grouper we have:

args = [iter(iterable)] * n

>>> iter('abc')
<iterator object at 0x10048b890>

So we're just making an iterator out of the iterable thing (string, list, tuple, set ...), and putting n copies of the same object into a list. That's the key---multiple copies of the same object.

>>> for obj in [iter('abc')]*3:
...     print obj
... 
<iterator object at 0x10048b890>
<iterator object at 0x10048b890>
<iterator object at 0x10048b890>

The *args part is just Python-speak for being able to pass a list of arguments (whose length isn't specified) into a function and then have Python "unpack" the arguments and use them. You can't look at what *args does directly, but only as part of a function's arguments:

>>> *[iter('abc')]*3
  File "", line 1
    *[iter('abc')]*3
    ^
SyntaxError: invalid syntax

>>> def popper(*args):
...     for e in args:
...         print e.next()
... 
>>> popper(*[iter('abc')]*3)
a
b
c

So... when we feed izip_longest multiple copies of the same iterator object in a list, and it tries to do its thing:

izip_longest(*args)

To make the first "group", it starts by grabbing the first element in our <iterator object at 0x10048b890>, giving 'a', plus the first element from <iterator object at 0x10048b890>, which now gives 'b', plus the first element from <iterator object at 0x10048b890>, which gives 'c'.

Popping the "first" element from the same iterator object a second time gives the second element, because the first is already gone.

Don't know how clear that is to you, but it makes sense to me now. The key is multiples of the same iterator object.