This is some text
--ä--: a with umlaut
Paste that into a text editor (like TextMate) and save in UTF-8 encoding (default).
> hexdump -C text8.txt 00000000 54 68 69 73 20 69 73 20 73 6f 6d 65 20 74 65 78 |This is some tex| 00000010 74 0a 2d 2d c3 a4 2d 2d 3a 20 20 61 20 77 69 74 |t.--..--: a wit| 00000020 68 20 75 6d 6c 61 75 74 0a 2d 2d e2 88 9a 2d 2d |h umlaut.--...--| 00000030 3a 20 20 73 71 72 74 |: sqrt| 00000037
[ Two other very useful flags are -n (number of bytes to read) and -s (offset, number of bytes to skip. Flags must preceed the filename. ]
The middle two columns are the bytes, and the last is the ASCII version for each, which we requested using the -C flag. Values at or above 128 (hex '0x80'), and I think the unprintable low-value ones, are shown as a "."
The output gives us a chance to practice hex arithmetic. For example, the third byte is '0x69' which is "i" and the fourth byte is '0x73' which is "s". That's because the hex characters between these two are:
'0x6a' '0x6b' '0x6c' '0x6d' '0x6e' '0x6f' '0x70' '0x71' '0x72' "j" "k" "l" "m" "n" "o" "p" "q" "r"
The a with umlaut is also called 'LATIN SMALL LETTER A WITH DIAERESIS' (reference). There is a nice Python doc about Unicode, also wikipedia, where you can learn about the saga of Klingon.
The UTF-8 encoding of 'LATIN SMALL LETTER A WITH DIAERESIS' is
and you can see these two bytes in the second line of the printout flanked by the repeated bytes '0x2d' for "--"
The way this works is that UTF-8 is a variable length encoding:
0zzzzzzz one-byte character 110yyyyy 10zzzzzz two-byte character 1110xxxx 10yyyyyy 10zzzzzz three-byte character
An ASCII text and its UTF-8 encoding would be identical. Let's grab the bits:
>>> bin(0xc3) '0b11000011' >>> bin(0xa4) '0b10100100'
In bytes then, '0xc3' '0xa4' is:
Remove the "meta" bits (not sure what they're actual name is at the moment), leaving:
and padding out, we obtain:
The two bytes above are the encoding in UTF-16, which we could just have obtained from the reference page.
This format is what's called big-endian.
The second non-ASCII character we have is the square-root symbol √ (encoded in this HTML page as ampersand-radic;)
In the output we have:
e2 88 9a
>>> bin(0xe2) '0b11100010' >>> bin(0x88) '0b10001000' >>> bin(0x9a) '0b10011010'
In binary, that is:
11100010 10001000 10011010
Stripping out the meta info:
0010 001000 011010
No padding needed:
which matches the reference.
I'll have to put off discussion of UTF encodings in Python to another time.