Python for Bioinformatics: UTF-16 endian-ness

Friday, April 27, 2012

UTF-16 endian-ness

This is a brief continuation of the post on hexdump. I saved the same data in different files as UTF-16 in both big-endian and little-endian formats using TextMate.

> hexdump -C text16B.txt 
00000000  fe ff 00 54 00 68 00 69  00 73 00 20 00 69 00 73  |...T.h.i.s. .i.s|
00000010  00 20 00 73 00 6f 00 6d  00 65 00 20 00 74 00 65  |. .s.o.m.e. .t.e|
00000020  00 78 00 74 00 0a 00 2d  00 2d 00 e4 00 2d 00 2d  |.x.t...-.-...-.-|
00000030  00 3a 00 20 00 20 00 61  00 20 00 77 00 69 00 74  |.:. . .a. .w.i.t|
00000040  00 68 00 20 00 75 00 6d  00 6c 00 61 00 75 00 74  |.h. .u.m.l.a.u.t|
00000050  00 0a 00 2d 00 2d 22 1a  00 2d 00 2d 00 3a 00 20  |...-.-"..-.-.:. |
00000060  00 20 00 73 00 71 00 72  00 74                    |. .s.q.r.t|
0000006a


> hexdump -C text16L.txt 
00000000  ff fe 54 00 68 00 69 00  73 00 20 00 69 00 73 00  |..T.h.i.s. .i.s.|
00000010  20 00 73 00 6f 00 6d 00  65 00 20 00 74 00 65 00  | .s.o.m.e. .t.e.|
00000020  78 00 74 00 0a 00 2d 00  2d 00 e4 00 2d 00 2d 00  |x.t...-.-...-.-.|
00000030  3a 00 20 00 20 00 61 00  20 00 77 00 69 00 74 00  |:. . .a. .w.i.t.|
00000040  68 00 20 00 75 00 6d 00  6c 00 61 00 75 00 74 00  |h. .u.m.l.a.u.t.|
00000050  0a 00 2d 00 2d 00 1a 22  2d 00 2d 00 3a 00 20 00  |..-.-.."-.-.:. .|
00000060  20 00 73 00 71 00 72 00  74 00                    | .s.q.r.t.|
0000006a

The first two bytes are a byte order mark or BOM. You can see how that works.

Notice the change in our example code points from last time. The ä ('LATIN SMALL LETTER A WITH DIAERESIS'), which is c3 a4 in UTF-8, is 00 e4 in big-endian UTF-16 (and the reverse in little-endian)

The √ is e2 88 9a in UTF-8, and is 22 1a in big-endian UTF-16 and reversed as well in little-endian.

The most important point is that the bytes are completely different depending on the encoding (UTF-8 versus UTF-16), and a difference of encoding is usually the source of weird or unexpected stuff you see printed on the screen, when the encoding and decoding don't match.

I wish they had come up with some other words to describe the byte order than big- and little-endian. The reference to Gulliver is amusing, but a string of bytes has two ends. Which is supposed to be the big one? It's similar to the situation with genes, where people talk about the 5' and 3' ends, but of course each end of double-stranded DNA has both 5' and 3' ends!

Big-endian is described as the "most significant byte first." How about "highest value byte first" or "natural order" ?