Strange .encode() result

If I do this :

a = "\xa0"

then do this :

print(a.encode())

it prints this :

b'\xc2\xa0'

Where is the \xc2 coming from?

That’s UTF-8 encoding!

In a Python string constant, \xa0 means Unicode codepoint #160 (NO-BREAK SPACE). When we encode that codepoint in UTF-8, it takes two bytes. Why?

Well, although the original ASCII encoding only had 128 different characters, there are many thousands of Unicode codepoints, and a single byte can only represent 256 different values. A lot of computing is based on ASCII, and we’d like that stuff to keep working, but we need non-English-speakers to be able to use computers too, so we need to be able to represent their characters.

The answer is UTF-8, an ingenious scheme that encodes the first 128 Unicode code points (0-127, the ASCII characters) as a single byte – so text that only uses those characters is completely compatible with ASCII. The next 1920 characters, containing the most common non-English characters (U+80 up to U+7FF, including U+A0, your NO-BREAK SPACE) are spread across two bytes. Beyond that, you’re in three or even four bytes for the “astral plane” (eg emoji).

So, in exchange for being slightly less efficient with some characters that could fit in a one-byte encoding (such as \xa0), we gain the ability to represent every character from every written human language on Earth (even the really weird ones – while still being compatible with ASCII for the common case. I’d say it’s a worthwhile trade :slight_smile:

For more reading, try this StackOverflow question: What’s the difference between a [Python] string and a byte string?

3 Likes

Thanks for such a full answer :slight_smile:

Not for me - it’s an ache in the trouser department. And, at the end of the day, I’m the only person who matters :slight_smile:

If you can explain the context of your usage a little more, we might be able to help. For example, if you want the byte rather than the Unicode string, I’d recommend writing it as a byte string: b"\xa0".

I’m just having trouble getting my head around a library that doesn’t work quite as it should I thought it should. It’s my problem only.

But if you’re bored :

When you add an optional TLV for “source_subaddress”, carriers require the message type, which in this case is \xA0. This then means that the optional part of the PDU is :

20200008A0xxxxxxxxxxxxxx

2020 being the optional TLV (source_subaddress), 0008 being the octet length of the subaddress information, A0 being the type of subaddress info (and the first octet in the length), and the last 7 octets being the actual data.

That library, when you add source_subaddress to the client.send_message() function (see the readme.md for the demo code), creates a PDU with a length of 0007 and no A0 type. When you try to add it, well, I just struggled to do so.

I ended up with this in command.py, which is a horrible, hard coded kludge :

               # I need to add a type of "A0" to the message for this TLV.
               # This is a bodge and I need a way to do it more generically.
               if field == "source_subaddress":
                    fvalue = field_value + chr(0)
                    field_length = len(fvalue)+1
                    value = struct.pack(">HHB", field_code, field_length, 160)  + fvalue.encode()                
                else:
                    # For everything else, do this : 
                    fvalue = field_value + chr(0)
                    field_length = len(fvalue)
                    value = struct.pack(">HH", field_code, field_length)  + fvalue.encode()

That’s definitely a job for byte strings rather than strings - they never need to be unicode in the first place! Try:

fvalue = bytes([field_value])
value = struct.pack(">HH", field_code, field_length) + fvalue

Not sure if that makes anything any different. I still have to have a separate pack to add just chr(160) to the front. Any attempt to add it elsewhere adds \xc2.

I’m sure I’m missing something obvious. I’ll take this up again in the morning.

Cheers.

edit - ah, but this worked ok :

fvalue = b'\xa0' + bytes(field_value , "utf-8") + b'0'

Now I just need a configurable way to decide when to add it…