How to correctly parse utf-8 xml with ElementTree?

Question

I need help to understand why parsing my xml file* with xml.etree.ElementTree produces the following errors.

*My test xml file contains arabic characters.

Task: Open and parse utf8_file.xml file.

My first try:

import xml.etree.ElementTree as etree
with codecs.open('utf8_file.xml', 'r', encoding='utf-8') as utf8_file:
    xml_tree = etree.parse(utf8_file)

Result 1:

UnicodeEncodeError: 'ascii' codec can't encode characters in position 236-238: ordinal not in range(128)

My second try:

import xml.etree.ElementTree as etree
with codecs.open('utf8_file.xml', 'r', encoding='utf-8') as utf8_file:
    xml_string = etree.tostring(utf8_file, encoding='utf-8', method='xml')
    xml_tree  = etree.fromstring(xml_string)

Result 2:

AttributeError: 'file' object has no attribute 'getiterator'

Please explain the errors above and comment on the possible solution.

score 25 · Accepted Answer · answered Feb 11 '14 at 09:41

25

Leave decoding the bytes to the parser; do not decode first:

import xml.etree.ElementTree as etree
with open('utf8_file.xml', 'r') as xml_file:
    xml_tree = etree.parse(xml_file)

An XML file must contain enough information in the first line to handle decoding by the parser. If the header is missing, the parser must assume UTF-8 is used.

Because it is the XML header that holds this information, it is the responsibility of the parser to do all decoding.

Your first attempt failed because Python was trying to encode the Unicode values again so that the parser could handle byte strings as it expected. The second attempt failed because etree.tostring() expects a parsed tree as first argument, not a unicode string.

answered Feb 11 '14 at 09:41

Martijn Pieters

1,048,767
296
4,058
3,343

Excellent, it appeared to be easier than I thought. Even "utf-8 without BOM" files get parsed correctly. – minerals Feb 11 '14 at 09:48
UTF-8 without BOM is the standard; *with* BOM is mostly Microsoft wanting to make it easier to autodetect 8-bit encodings other than UTF-8. – Martijn Pieters Feb 11 '14 at 09:53
5

`etree.parse(a_file)` handles Unicode by default. However `etree.fromstring(a_string)` doesn't until Python 3.x (see http://bugs.python.org/issue11033) so to parse a string, you have to encode it manually, like `etree.fromstring(a_string.encode('utf-8'))`. – Chris Johnson Aug 15 '16 at 12:03
@ChrisJohnson: This question is about Python 2, where file objects produce byte strings, not Unicode. The question concerns the user reading data from a file and manually decoding, which is entirely pointless. – Martijn Pieters Aug 15 '16 at 12:05
@MartijnPieters I agree. This comment is meant to point out a non-obvious behavior for anyone looking into the string-based approach. It's non-obvious that the file-based method handles encoding by default but the string-based method requires pre-encoding. – Chris Johnson Aug 15 '16 at 12:12
You can make it simpler and skip opening it as a file, I have code that does `root = et.parse(sys.stdin).getroot()` and it works just fine. Tested in Py3.6 – Marcin Oct 18 '17 at 16:09
@Marcin: but that requires piping in the XML file. That's a different use case. – Martijn Pieters Oct 18 '17 at 16:11
Also works with `sys.argv[1]`, I just used stdin as an example. – Marcin Oct 18 '17 at 16:16
@Marcin: right, that's what you mean. Yes, you can pass in an open file object or a filename. – Martijn Pieters Oct 18 '17 at 16:20
@MartijnPieters I can see `cElementTree.iterparse()` also tries to decode, which in my case generates `UnicodeDecodeError: 'ascii' codec can't decode byte 0xe6 in position 293: ordinal not in range(128)`. I am simply passing the file object. Can I help it to decode somehow? – Tom Hemmes Dec 10 '18 at 16:15
@TomHemmes: no, not without a traceback and example input, sorry. – Martijn Pieters Dec 10 '18 at 17:00

How to correctly parse utf-8 xml with ElementTree?

1 Answers1

Linked