1

Steps to reproduce:

  1. Create a file test.txt with content This is 中文 (i.e., UTF-8 encoded non-ASCII text).
  2. Custom-compile python 3.5.2 on the Intel Edison.
  3. Launch the custom-compiled python3 interpreter and issue the following piece of code:

    with open('test.txt', 'r') as fh:
        fh.readlines()
    

Actual behavior:

A UnicodeDecodeError exception is thrown. The file is opened as 'ASCII' instead of 'UTF-8' by default:

UnicodeDecodeError: 'ascii' codec can't decode byte 0xe4 in position 8: ordinal not in range(128)

On a "regular" Linux system this problem is easily solved by setting a proper locale, see e.g. this post or that post. On the Intel Edison, however, I cannot set the LC_CTYPE since the default Yocto Linux distribution is missing locales (see e.g. this page).

I also tried to use a couple of other hacks like

import sys; sys.getfilesystemencoding = lambda: 'UTF-8'
import locale; locale.getpreferredencoding = lambda: 'utf-8'

And I tried setting the PYTHONIOENCODING=utf8 environment variable before starting the python interpreter.

However, none of this works. The only workaround is to specify the encoding explicitly as a command line parameter to the open command. This works for the above snippet, but it won't set the system-wide default for all packages I am using (that will implicitly open files as ASCII and may or may not offer me a way to override that default behavior).

What is the proper way to set the python interpreter default filesystem encoding? (Of course without installing unneeded system-wide locales.)

user8472
  • 3,268
  • 3
  • 35
  • 62
  • Why not just use `open('test.txt', 'r', encoding='utf8')`? Explicit is better than implicit. Don't use hacks. – Martijn Pieters May 26 '17 at 13:46
  • `sys.getfilesystemencoding()` is not used to determine the encoding of newly opened files. Replacing `locale.getpreferredencoding` won't work, because the *C code that opens files* won't call the Python version, it has direct access to the original C function. – Martijn Pieters May 26 '17 at 13:47
  • `PYTHONIOENCODING` applies to `stdin`, `stdout` and `stderr`, not to `open()`. – Martijn Pieters May 26 '17 at 13:47
  • Why not just use `LC_ALL='.UTF-8'` when running Python? You don't need to set the whole locale, just that environment variable should suffice. – Martijn Pieters May 26 '17 at 13:48
  • @MartijnPieters I pointed out that adding the `encoding='utf8'` option works for the code snippet. But it won't work for all packages that I import and which open files internally. I need to have the behavior of having `LC_CTYPE` set to `.UTF-8` without being able to set this explicitly (as the Yocto Linux distribution does not install locales by default). – user8472 May 26 '17 at 13:54
  • It's just an environment variable, set it only for those Python processes. – Martijn Pieters May 26 '17 at 13:57

1 Answers1

1

You can set the LC_ALL environment variable to alter the default:

$ python3 -c 'import locale; print(locale.getpreferredencoding())'
UTF-8
$ LC_ALL='.ASCII' python3 -c 'import locale; print(locale.getpreferredencoding())'
US-ASCII

I tested this both on OS X and CentOS 7.

As for your other attempts, here is why they don't work:

  • sys.getfilesystemencoding() applies to filenames only (e.g. os.listdir() and friends).
  • The io module doesn't actually use the locale.getpreferrredencoding() function, so altering the function on the module won't have an effect. A lightweight _bootlocale.py bootstrap module is used instead. More on that below.
  • PYTHONIOENCODING only applies to sys.stdin, sys.stdout and sys.stdstderr

If setting environment variables ultimately fails, you can still patch the _bootlocale module:

import _bootlocale

old = _bootlocale.getpreferredencoding  # so you can restore
_bootlocale.getpreferredencoding = lambda *args: 'UTF-8'

This works for me (again on OS X and CentOS 7, tested with 3.6):

>>> import _bootlocale
>>> open('/tmp/test.txt', 'w').encoding  # showing pre-patch setting
'UTF-8'
>>> old = _bootlocale.getpreferredencoding
>>> _bootlocale.getpreferredencoding = lambda *a: 'ASCII'
>>> open('/tmp/test.txt', 'w').encoding  # gimped hook
'ASCII'
Martijn Pieters
  • 1,048,767
  • 296
  • 4,058
  • 3,343
  • My question is specifically about the Intel Edison. That is an embedded system with a customizable, minimal Linux system that lacks many features which macOS X and Desktop Linux distributions have. In particular, there are no locales installed so setting `LC_*` just leads to an error message: `-sh: warning: setlocale: LC_ALL: cannot change locale (.UTF-8): No such file or directory` – user8472 May 26 '17 at 13:57
  • @user8472: but that's just a warning. Does *Python* work? – Martijn Pieters May 26 '17 at 14:05
  • The output from the python interpreter is ALWAYS `ANSI_X3.4-1968`, regardless of the content of the `LC_ALL` environment variable. Likewise, the `open` command in the interpreter ALWAYS fails UNLESS I explicitly set the encoding to `'utf-8'`. The `open` commands in packages I include also fail reproducibly, regardless of the value of the `LC_ALL` environment variable. – user8472 May 26 '17 at 14:15
  • @user8472: well, that sucks. Only goes to show that you should never, ever ever ever, rely on the locale when opening files and you **must** use `encoding='...'` whenever you want to read textual data that is not ASCII. However, I figured out what hook the `io` module *really* uses, see the update. – Martijn Pieters May 26 '17 at 15:29
  • Yes, that works! Both for my code (for which I agree I **should** specify the encoding explicitly) and for code I import (on which I have no control and where I may not be able to specify encodings explicitly). – user8472 May 27 '17 at 06:08