Friday, April 27, 2012

hexdump

This is a quick look at hexdump, with a brief discussion of UTF encodings.

This is some text
--ä--: a with umlaut
--√--: sqrt


Paste that into a text editor (like TextMate) and save in UTF-8 encoding (default).

> hexdump -C text8.txt 
00000000  54 68 69 73 20 69 73 20  73 6f 6d 65 20 74 65 78  |This is some tex|
00000010  74 0a 2d 2d c3 a4 2d 2d  3a 20 20 61 20 77 69 74  |t.--..--:  a wit|
00000020  68 20 75 6d 6c 61 75 74  0a 2d 2d e2 88 9a 2d 2d  |h umlaut.--...--|
00000030  3a 20 20 73 71 72 74                              |:  sqrt|
00000037

[ Two other very useful flags are -n (number of bytes to read) and -s (offset, number of bytes to skip. Flags must preceed the filename. ]

The middle two columns are the bytes, and the last is the ASCII version for each, which we requested using the -C flag. Values at or above 128 (hex '0x80'), and I think the unprintable low-value ones, are shown as a "."

The output gives us a chance to practice hex arithmetic. For example, the third byte is '0x69' which is "i" and the fourth byte is '0x73' which is "s". That's because the hex characters between these two are:

'0x6a' '0x6b' '0x6c' '0x6d' '0x6e' '0x6f' '0x70' '0x71' '0x72'
"j"    "k"    "l"    "m"    "n"    "o"    "p"    "q"    "r"

The a with umlaut is also called 'LATIN SMALL LETTER A WITH DIAERESIS' (reference). There is a nice Python doc about Unicode, also wikipedia, where you can learn about the saga of Klingon.

The UTF-8 encoding of 'LATIN SMALL LETTER A WITH DIAERESIS' is

'0xc3' '0xa4'

and you can see these two bytes in the second line of the printout flanked by the repeated bytes '0x2d' for "--"

The way this works is that UTF-8 is a variable length encoding:

0zzzzzzz                    one-byte character
110yyyyy 10zzzzzz           two-byte character
1110xxxx 10yyyyyy 10zzzzzz  three-byte character

An ASCII text and its UTF-8 encoding would be identical. Let's grab the bits:

>>> bin(0xc3)
'0b11000011'
>>> bin(0xa4)
'0b10100100'

In bytes then, '0xc3' '0xa4' is:

11000011 10100100

Remove the "meta" bits (not sure what they're actual name is at the moment), leaving:

00011 100100

and padding out, we obtain:

00000000 11100100

which is:

00 e4

The two bytes above are the encoding in UTF-16, which we could just have obtained from the reference page.

This format is what's called big-endian.

The second non-ASCII character we have is the square-root symbol √ (encoded in this HTML page as ampersand-radic;)

In the output we have:

e2 88 9a

>>> bin(0xe2)
'0b11100010'
>>> bin(0x88)
'0b10001000'
>>> bin(0x9a)
'0b10011010'

In binary, that is:

11100010 10001000 10011010

Stripping out the meta info:

0010 001000 011010

No padding needed:

00100010 00011010

In hex:

22 1a

which matches the reference.

I'll have to put off discussion of UTF encodings in Python to another time.