1

I'm trying to read in Python3 a text file specifying encoding cp1252 which has unmapped characters (for instance byte 0x8d).

with open(inputfilename, mode='r', encoding='cp1252') as inputfile:
    print(inputfile.readlines())

I obviously get the following exception:

Traceback (most recent call last):
  File "test.py", line 9, in <module>
    print(inputfile.readlines())
  File "/usr/lib/python3.6/encodings/cp1252.py", line 23, in decode
    return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x8d in position 14: character maps to <undefined>

I'd like to understand why, when reading the same file with encoding latin-1, I don't get the same exception and the byte 0x8d is represented as hex string:

$ python3 test.py
['This is a test\x8d file\n']

As far as i know byte 0x8d does not have a match on both encodings (latin-1 and cp1252). What am I missing? Why Python3 behaviour is different?

snakecharmerb
  • 47,570
  • 11
  • 100
  • 153
Andrea Baldini
  • 1,027
  • 1
  • 6
  • 17
  • 2
    latin-1 is special in that it will decode any bytestring, returning the original byte if it has no latin-1 equivalent. Other encodings (like cp1252) will raise UnicodeDecodeError is the byte cannot be mapped. – snakecharmerb Oct 22 '19 at 10:00
  • Thanks for your reply. Why Python developers chose to handle latin-1 this way? Is there any official reference about this behaviour? Are there other "special" encodings? – Andrea Baldini Oct 22 '19 at 10:13
  • I _believe_ that this is a universal property of latin-1, rather than a Python-specific behaviour, but I can't find any authoritative confirmation of this, which is why I'm commenting instead of answering. AFAIK no other text-encoding defines this behaviour. – snakecharmerb Oct 22 '19 at 10:20
  • Latin-1 reserved 8x and 9x, and such values were used for C1 (https://en.wikipedia.org/wiki/C0_and_C1_control_codes), as you note, it is not a random reason. Because Latin1 should not use them, and if someone use it in Latin1, there should be C1, there is not much ambiguity (but against the Python mantra: "better explicit then implicit") – Giacomo Catenazzi Oct 22 '19 at 16:57

1 Answers1

0

from the docs: The simplest text encoding (called 'latin-1' or 'iso-8859-1') maps the code points 0–255 to the bytes 0x0–0xff

https://docs.python.org/3/library/codecs.html

o17t H1H' S'k
  • 2,541
  • 5
  • 31
  • 52