Python for Bioinformatics: Unicode is weird fun!

Thursday, May 3, 2012

Unicode is weird fun!

I came across something fun in an SO answer the other day:

˙sɯǝlqoɹd ʎuɐ pɐɥ ɹǝʌǝu ǝʌ,I puɐ uʍop ǝpısdn pǝʇunoɯ ǝɹɐ sǝɥɔʇıʍs ʞɹoʍʇǝu ʎɯ ɟo ll∀'

That's pretty entertaining! I pasted it into the Python interpreter:

>>> s = '˙sɯǝlqoɹd ʎuɐ pɐɥ ɹǝʌǝu ǝʌ,I puɐ uʍop ǝpısdn pǝʇunoɯ ǝɹɐ sǝɥɔʇıʍs ʞɹoʍʇǝu ʎɯ ɟo ll∀'
>>> s
'\xcb\x99s\xc9\xaf\xc7\x9dlqo\xc9\xb9d \xca\x8eu\xc9\x90 p\xc9\x90\xc9\xa5 \xc9\xb9\xc7\x9d\xca\x8c\xc7\x9du \xc7\x9d\xca\x8c,I pu\xc9\x90 u\xca\x8dop \xc7\x9dp\xc4\xb1sdn p\xc7\x9d\xca\x87uno\xc9\xaf \xc7\x9d\xc9\xb9\xc9\x90 s\xc7\x9d\xc9\xa5\xc9\x94\xca\x87\xc4\xb1\xca\x8ds \xca\x9e\xc9\xb9o\xca\x8d\xca\x87\xc7\x9du \xca\x8e\xc9\xaf \xc9\x9fo ll\xe2\x88\x80'

According to this, the flipped version of 'a' ('0x61') is '\u0250', in UTF-8 that's '\xc9\x90'. (There are a bunch of other mappings). You can see these two bytes in the hex data right after ' p': '\xc9\x90'.

But it's going to take some 'splainin'.

This is a great place to start. The thing about Unicode is that it is fundamentally just a listing of integer values and the corresponding "code points." Just take all the interesting symbols you can find and organize them and then assign an integer to one after another. That's not hard to understand. What's hard is when you get into representing these integers in multi-byte encodings (e.g. UTF-8) and, of course, in printing them.

We can get some info about this in Python using the unicodedata module.

>>> import unicodedata
>>> i = 592
>>> hex(i)
'0x250'
>>> c = unichr(i)
>>> print c
ɐ
>>> c.encode('utf8')
'\xc9\x90'
>>> '\xc9\x90'.decode('utf8')
u'\u0250'
>>> unicodedata.name(c)
'LATIN SMALL LETTER TURNED A'
>>>

The unicode "value" is 2*16*16 + 5*16 = 512 + 80 = 592.

What makes it complicated is the gyrations involved in representing this in various other formats, like UTF-8 or HTML ('&' + '#592;'). I grabbed the defs for the flipTable from the reference above.

Then I went exploring in Python (see the script for details):

> python script.py 
a
from flipTable:
ɐ    592 \u0250 LATIN SMALL LETTER TURNED A
from unicodedata:
ɐ    (592, '0x250', u'\u0250', 'LATIN SMALL LETTER TURNED A')
ᵄ    (7492, '0x1d44', u'\u1d44', 'MODIFIER LETTER SMALL TURNED A')
..
e
from flipTable:
ǝ    477 \u01DD LATIN SMALL LETTER TURNED E
from unicodedata:
ǝ    (477, '0x1dd', u'\u01dd', 'LATIN SMALL LETTER TURNED E')
ᴈ    (7432, '0x1d08', u'\u1d08', 'LATIN SMALL LETTER TURNED OPEN E')
ᵌ    (7500, '0x1d4c', u'\u1d4c', 'MODIFIER LETTER SMALL TURNED OPEN E')
ⱻ    (11387, '0x2c7b', u'\u2c7b', 'LATIN LETTER SMALL CAPITAL TURNED E')

Many of the "flips" are silly ('q' for 'b' and so on), but lots of them are absolutely right. Finally, my Python doesn't seem to know about anything i >= 2**16.

>>> i = 2**16
>>> i
65536
>>> unichr(i)
Traceback (most recent call last):
  File "", line 1, in 
ValueError: unichr() arg not in range(0x10000) (narrow Python build)
>>> i -= 1
>>> unichr(i)
u'\uffff'
>>> print unichr(i)

>>>
script.py

from collections import defaultdict
import unicodedata
from flip import flipTable

letters = 'abcdefghijklmnopqrstuvwxyz'
L = list(letters)
D = defaultdict(list)

def caps(s):
    L = [c.capitalize() for c in s]
    return ''.join(L)

for i in range(1,2**16):
    u = unichr(i)
    try:
        n = unicodedata.name(u)
    except ValueError:
        continue
    v = n.lower()
    if 'turned' in v:
        k = v.split()[-1]
        if len(k) > 1:
            continue
        if not 'small' in v:
            k = k.upper()
        D[k].append((i,hex(i),u,n))

for s in letters:
    print s
    c = '\u00' + caps(hex(ord(s))[2:])
    try:
        v = flipTable[c]
        i = int('0x' + v[2:],16)
        u = unichr(i)
        print 'from flipTable:'
        print u.ljust(4), i, v, 
        print unicodedata.name(u)
    except KeyError:
        pass
    if D[s]:
        print 'from unicodedata:'
    for e in D[s]:
        print e[2].ljust(4), e
    print '-' * 10