Strange .encode() result

meredydd · April 13, 2020, 5:26pm

In a Python string constant, \xa0 means Unicode codepoint #160 (NO-BREAK SPACE). When we encode that codepoint in UTF-8, it takes two bytes. Why?

Well, although the original ASCII encoding only had 128 different characters, there are many thousands of Unicode codepoints, and a single byte can only represent 256 different values. A lot of computing is based on ASCII, and we’d like that stuff to keep working, but we need non-English-speakers to be able to use computers too, so we need to be able to represent their characters.

The answer is UTF-8, an ingenious scheme that encodes the first 128 Unicode code points (0-127, the ASCII characters) as a single byte – so text that only uses those characters is completely compatible with ASCII. The next 1920 characters, containing the most common non-English characters (U+80 up to U+7FF, including U+A0, your NO-BREAK SPACE) are spread across two bytes. Beyond that, you’re in three or even four bytes for the “astral plane” (eg emoji).

So, in exchange for being slightly less efficient with some characters that could fit in a one-byte encoding (such as \xa0), we gain the ability to represent every character from every written human language on Earth (even the really weird ones – while still being compatible with ASCII for the common case. I’d say it’s a worthwhile trade

For more reading, try this StackOverflow question: What’s the difference between a [Python] string and a byte string?