-1

When running the code:

#! /usr/bin/env python
# -*- coding: UTF-8 -*-

import xml.etree.ElementTree as ET
print ET.fromstring('<?xml version="1.0" encoding="UTF-8" standalone="yes"?><root><road>vägen</road></root>').find('road').text 

Produces the expected output vägen, however if piping this to wc -l I get a UnicodeEncodeError, e.g. (TEerr.py holds the code snippet given above):

:~> ETerr.py | wc -l
Traceback (most recent call last):
  File "./ETerr.py", line 5, in <module>
    print ET.fromstring('<?xml version="1.0" encoding="UTF-8" standalone="yes"?><root><road>vägen</road></root>').find('road').text 
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe4' in position 1: ordinal not in range(128)
0
:~> 

How can the code behave differently if its output is piped or not and how can I fix it so that it doesn't.

Please note that the code snippet above is merely set up to demonstrate the issue with as little code as possible, in the actual script where I need to resolve the issue the xml is retrieved using urllib hence I have little control over its format.

cpaitor
  • 423
  • 1
  • 3
  • 16
  • Try breaking this into pieces: one that stores the xml in a string, one hat does the ET parse, one that does the find, and finally one that does the print. It’s _probably_ the last one that fails, but knowing for sure is useful. – abarnert Apr 09 '18 at 14:56
  • Anyway: when you print a Unicode string in Python 2, it uses your default encoding. It’s probably guessing a default of UTF-8 or Latin1 when stdout is a terminal, but ASCII when it’s a pipe. Try printing out `sys.getdefaultencoding()` in both cases. If this is the problem, show us what your `LOCALE` and `LC_`-prefixed environment variables are. – abarnert Apr 09 '18 at 15:00
  • 1
    add `.encode('utf-8')` to the end of the print statement – eagle Apr 09 '18 at 15:02
  • If you know your terminal and tools all expect a particular encoding and just want to force your code to that encoding no matter what, not caring about portability to other systems, you can `print u.encode(‘utf-8’)`. But it would be better to diagnose and solve the actual config problems – abarnert Apr 09 '18 at 15:04
  • 1
    Also, is there a reason you have to use Python 2 for this? Because the stdio encoding is one of the things that’s been improved in Python 3, and improved a few more times by Python 3.7, and this most likely won’t even come up in the first place. – abarnert Apr 09 '18 at 15:06
  • It's a pure python problem as shown in the stack trace, a command in a pipe does not now anything about the next command and does not care about it. – LMC Apr 09 '18 at 15:13
  • @abarnert, yes it is the `print` statement that fails. `sys.getdefaultencoding()` yields `ascii` no matter if the output is piped or not – cpaitor Apr 09 '18 at 15:17
  • @eagle, sure this solves the problem but does not really explain why the behaviour differs when output is piped or not – cpaitor Apr 09 '18 at 15:19
  • One more thing to rule out: I'm guessing you're on a POSIX system (Linux or Mac or other *BSD), not Windows; am I right? (Even with Cygwin Python, things get more complicated on Windows, so I want to make sure we can ignore that here.) – abarnert Apr 09 '18 at 17:08

1 Answers1

0

First, let me point out that this is not a problem in Python 3, and fixing it is in fact one of the reasons that it was worth a compatibility-breaking change to the language in the first place. But I'll assume you have a good reason for using Python 2, and can't just upgrade.


The proximate cause here (assuming you're using Python 2.7 on a POSIX platform—things can be more complicated on older 2.x, and on Windows) is the value of sys.stdout.encoding. When you start up the interpreter, it does the equivalent of this pseudocode:

if isatty(stdoutfd):
    sys.stdout.encoding = parse_locale(os.environ('LC_CTYPE'))
else:
    sys.stdout.encoding = None

And every time you write to a file, including sys.stdout, including implicitly from a print statement, it does something like this:

if isinstance(s, unicode):
    if self.encoding:
        s = s.encode(self.encoding)
    else:
        s = s.encode(sys.getdefaultencoding())

The actual code does standard POSIX stuff looking for fallbacks like LANG, and hardcodes a fallback to UTF-8 in some cases for Mac OS X, etc., but this is close enough.


This is only sparsely documented, under file.encoding:

The encoding that this file uses. When Unicode strings are written to a file, they will be converted to byte strings using this encoding. In addition, when the file is connected to a terminal, the attribute gives the encoding that the terminal is likely to use (that information might be incorrect if the user has misconfigured the terminal). The attribute is read-only and may not be present on all file-like objects. It may also be None, in which case the file uses the system default encoding for converting Unicode strings.


To verify that this is your problem, try the following:

$ python -c 'print __import__("sys").stdout.encoding'
UTF-8
$ python -c 'print __import__("sys").stdout.encoding' | cat
None

To be extra sure this is the problem:

$ PYTHONIOENCODING=Latin-1 python -c 'print __import__("sys").stdout.encoding'
Latin-1
$ PYTHONIOENCODING=Latin-1 python -c 'print __import__("sys").stdout.encoding' | cat
Latin-1

So, how do you fix this?

Well, the obvious way is to upgrade to Python 3.6, where you'll get UTF-8 in both cases, but I'll assume there's a reason you're using Python 2.7 and can't easily change it.

The right solution is actually pretty complicated. But if you want a quick&dirty solution that works for your system, and for most current Linux and Mac systems with standard Python 2.7 setups (even though it may be disastrously wrong for older Linux systems, older Python 2.x versions, and weird setups), you can either:

  • Set the environment variable PYTHONIOENCODING to override the detection and force UTF-8. Setting this in your profile or similar may be worth doing if you know that every terminal and every tool you're ever going to use from this account is UTF-8, although it's a terrible idea if that isn't true.
  • Check sys.stdout.encoding and wrap it with a 'UTF-8' encoding if it's None.
  • Explicitly .encode('UTF-8') on everything you print.
abarnert
  • 354,177
  • 51
  • 601
  • 671
  • You are right I should have added that this is not any problem in python3 and that the problem manifests on a POSIX platform. Interesting though, what is the point to setting the encoding of stdout depending on it is a tty-like device or not? – cpaitor Apr 09 '18 at 18:51
  • @cpaitor My guess is that the original rationale made sense for most 90s Unix systems (where the terminal locale might well be completely decoupled from what was expected to be in most text files—if it was even anything useful in the first place), and that all of the attempts to come up with something that better fit modern Linux had too many backward compatibility problems to implement before 3.0, but I don’t know for sure. (OS X made it easy by just mandating that the terminal, all userland tools, and all text files are UTF-8, period…) – abarnert Apr 09 '18 at 19:57
  • @cpaitor Also: the linux distros were gradually converting to "everything is UTF-8 by default" at about the same time they were starting their Python 3 migrations. So it probably made sense to put more energy into making sure Python 3 worked out-of-the-box on their distro than to set up custom configurations of Python 2 for their distro packages. (Especially since at least Canonical and Redhat had Python core devs on the payroll working on the Python 3 migration.) – abarnert Apr 09 '18 at 20:14