Python for Bioinformatics: Grouping list elements

Wednesday, December 2, 2009

Grouping list elements

I was browsing Stack Overflow again. I really like the site, and I'm using it to improve my knowledge of Python. The rating system for questions and answers (and answerers) is really helpful.

But sometimes I have to wonder... For example, here is a question from yesterday about how to group elements from a sequence (and since it's from yesterday, it is naturally a duplicate). The answer that people like has me scratching my head. Here is how I do it:

from string import uppercase as u
L = list(u)

def groupby(L,N=3):
    R = range(0,len(L),N)
    return [L[i:i+N] for i in R]

L2 = groupby(L)
print L2[:1]

# prints:   [['A', 'B', 'C']]

I like this because: (a) it works, and (b) it's simple enough to understand at a glance. For very long sequences you might want to use xrange within the list comprehension. Here is the popular answer:

>>> from itertools import izip_longest
>>> L=[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11]
>>> list(izip_longest(*[iter(L)]*3))
[(1, 2, 3), (4, 5, 6), (7, 8, 9), (10, 11, None)]

This is just crazy. It is not at all clear how this works or what it does. It would have to be commented in code. And, how does it work? The docs are here. We first make a list of iter(L) objects:

>>> [iter(L)]*2
[<listiterator object at 0x10048b810>, <listiterator object at 0x10048b810>]

But when I try to unpack it with *, I get:

>>> *[iter(L)*3]
  File "", line 1
    *[iter(L)*3]
    ^
SyntaxError: invalid syntax

So the bottom line is that I don't really know how it works because I can't take it apart.

What izip_longest does is take a list of lists and pop items off each one in turn to put into the groups. From the docs:

izip_longest('ABCD', 'xy', fillvalue='-') 
--> Ax By C- D-

It seems like a lot of extra work is being done here, to go along with the obfuscation. These guys should go back to Perl!