0

I have to parse XML files that start as such:

xml_string = '''<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
    <annotationStandOffs xmlns="http://www.tei-c.org/ns/1.0">
        <standOff> 
    ...
</standOff> 
</annotationStandOffs>
    '''

The following code will only fly if I eliminate the first line of the above shown string:

import xml.etree.ElementTree as ET
from lxml import etree
parser = etree.XMLParser(resolve_entities=False,strip_cdata=False,recover=True)
    
XML_tree = etree.XML(xml_string,parser=parser)

Otherwise I get the error:

ValueError: Unicode strings with encoding declaration are not supported. Please use bytes input or XML fragments without declaration.

JFerro
  • 3,203
  • 7
  • 35
  • 88

1 Answers1

0

As the error indicates, the encoding part of the XML declaration is meant to provide the necessary information about how to convert bytes (e.g. as read from a file) into string. It doesn't make sense when the XML already is a string.

Some XML parsers silently ignore this information when parsing from a string. Some throw an error.

So since you're pasting XML into a string literal in Python source code, it would only make sense to remove the declaration yourself while you're editing the Python file.

The other, not so smart option would be to use a byte string literal b'''...''', or to encode the string into a single-byte encoding at run-time '''...'''.encode('windows-1252'). But this opens another can of worms. When your Python file encoding (e.g. UTF-8) clashes the alleged XML encoding from your copypasted XML (e.g. UTF-16), you'll get more interesting errors.

Long story short, don't do that. Don't copypaste XML into Python source code without taking the XML declaration out. And don't try to "fix" it by run-time string encode() tomfoolery.

The opposite is also true. If you have bytes (e.g. read from a file in binary mode, or from a network socket) then give those bytes to the XML parser. Don't manually decode() them into string first.

Tomalak
  • 332,285
  • 67
  • 532
  • 628