1

In both cp1250 and latin2, there is no character corresponding to the byte \x88 (cf. gray cells in the code page tables). Yet, if I try to decode this byte using the two encodings in Python 3, I get different results. The first encoding yields an error:

>>> b"\x88".decode("cp1250")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/python3.6/encodings/cp1250.py", line 15, in decode
    return codecs.charmap_decode(input,errors,decoding_table)
UnicodeDecodeError: 'charmap' codec can't decode byte 0x88 in position 0: character maps to <undefined>

Which makes sense, since there really is no character defined for that byte. However, the second encoding returns a character corresponding to Unicode codepoint \u0088, even though it shouldn't be defined either:

>>> b"\x88".decode("latin2")
'\x88'
>>> "\x88" == "\u0088"
True

Why is that?

dlukes
  • 1,313
  • 16
  • 27
  • 1
    I've often seen it said that ISO-8859-1 (latin-1) permits decoding any byte; I guess the same applies to other ISO encoding. On the other hand it seems that charmap encodings do not have this property. However I can't find any authoritative confirmation of this. – snakecharmerb Dec 05 '19 at 14:42
  • Interesting, thanks for the comment! I imagined the explanation must be something like this, and it's comforting to know it rings a bell for some people. Hopefully someone can dig up an authoritative source :) – dlukes Dec 05 '19 at 20:56
  • Interestingly, the docs state that "Each charmap encoding can decode any random byte sequence.", which is clearly not the case. https://docs.python.org/3/library/codecs.html – dlukes Dec 05 '19 at 21:09
  • 1
    See also Giacomo's comment on [this question](https://stackoverflow.com/questions/58501530/python3-different-behaviour-between-latin-1-and-cp1252-when-decoding-unmapped-ch) – snakecharmerb Aug 15 '20 at 10:27

0 Answers0