Inconsistent character decoding errors in Python

Question

In both cp1250 and latin2, there is no character corresponding to the byte \x88 (cf. gray cells in the code page tables). Yet, if I try to decode this byte using the two encodings in Python 3, I get different results. The first encoding yields an error:

>>> b"\x88".decode("cp1250")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/python3.6/encodings/cp1250.py", line 15, in decode
    return codecs.charmap_decode(input,errors,decoding_table)
UnicodeDecodeError: 'charmap' codec can't decode byte 0x88 in position 0: character maps to <undefined>

Which makes sense, since there really is no character defined for that byte. However, the second encoding returns a character corresponding to Unicode codepoint \u0088, even though it shouldn't be defined either:

>>> b"\x88".decode("latin2")
'\x88'
>>> "\x88" == "\u0088"
True

Why is that?

I've often seen it said that ISO-8859-1 (latin-1) permits decoding any byte; I guess the same applies to other ISO encoding. On the other hand it seems that charmap encodings do not have this property. However I can't find any authoritative confirmation of this. — snakecharmerb, Dec 05 '19 at 14:42
Interesting, thanks for the comment! I imagined the explanation must be something like this, and it's comforting to know it rings a bell for some people. Hopefully someone can dig up an authoritative source :) — dlukes, Dec 05 '19 at 20:56
Interestingly, the docs state that "Each charmap encoding can decode any random byte sequence.", which is clearly not the case. https://docs.python.org/3/library/codecs.html — dlukes, Dec 05 '19 at 21:09
See also Giacomo's comment on [this question](https://stackoverflow.com/questions/58501530/python3-different-behaviour-between-latin-1-and-cp1252-when-decoding-unmapped-ch) — snakecharmerb, Aug 15 '20 at 10:27

Inconsistent character decoding errors in Python

0 Answers0