1

I'm getting an exception when reading a file that contains a RIGHT DOUBLE QUOTATION MARK Unicode symbol. It is encoded in UTF-8 (0xE2 0x80 0x9D). The minimal example:

import sys

print(sys.getdefaultencoding())

f = open("input.txt", "r")
r.readline()

This script fails reading the first line even if the right quotation mark is not on the first line. The exception looks like that:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\Program Files\Python36\lib\encodings\cp1252.py", line 23, in decode
    return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 102: char
acter maps to <undefined>

The input file is in utf-8 encoding, I've tried both with and without BOM. The default encoding returned by sys.getdefaultencoding() is utf-8.

This script fails on the machine with Python 3.6.5 but works well on another with Python 3.6.0. Both machines are Windows.

My questions are mostly theoretical, as this exception is thrown from external software that I cannot change, and it reads file that I don't wish to change. What should be the difference in these machines except the Python patch version? Why does vanilla open use cp1252 if the system default is utf-8?

Dmitry Kuzminov
  • 6,180
  • 6
  • 18
  • 40
  • open has an argument `errors=None` wehre `errors` can be `ignore`, `replace`, https://docs.python.org/3/library/functions.html#open also `locale.getpreferredencoding(False)` is used to get the encoding – Epsi95 Aug 23 '21 at 04:07
  • Can you try opening in binary mode "rb" to read/write binary data as is without any transformations such as converting newlines to/from platform-specific values or decoding/encoding text using a character encoding – SidJ Aug 23 '21 at 04:13
  • @Epsi95, the third party software that fails, doesn't use these arguments. I didn't ask how to make the minimal example work, but why does it fail on one machine, works on another and uses cp1252. – Dmitry Kuzminov Aug 23 '21 at 04:15
  • can you print `locale.getpreferredencoding(False)` in the two versions that u r running? seems encoding issue – Epsi95 Aug 23 '21 at 04:22
  • @Epsi95, cp1251 for working and cp1252 for non working. That answers part of the question. – Dmitry Kuzminov Aug 23 '21 at 04:26
  • `Why does vanilla open use cp1252 if the system default is utf-8` ... `open` uses `locale.getpreferredencoding(False)` to get the encoding so one should explicitly tell `utf-8` in `open` – Epsi95 Aug 23 '21 at 04:27
  • 3
    For consistency, when I know what I am reading, I like to be explicit, `open("input.txt", "rt", encoding="utf-8")` – Amadan Aug 23 '21 at 04:40

1 Answers1

3

As clearly stated in Python's open documentation:

In text mode, if encoding is not specified the encoding used is platform dependent: locale.getpreferredencoding(False) is called to get the current locale encoding.

Windows defaults to a localized encoding (cp1252 on US and Western European versions). Linux typically defaults to utf-8.

Because it is platform-dependent, use the encoding parameter and specify the encoding of the file explicitly.

Mark Tolonen
  • 166,664
  • 26
  • 169
  • 251