Central way to filter invalid unicode chars in lxml?

Question

It is common knowledge that certain character ranges aren't allowed in XML documents. I'm aware of solutions to filter those characters out (like [1], [2]).

Going with the Don't Repeat Yourself principle, I would prefer to implement one of these solutions in one central point – right now, I have to sanitize any potentially unsafe text before it is fed to lxml. Is there a way to achieve this, e.g. by subclassing a lxml filter class, catching some exceptions, or setting a configuration switch?

Edit: To hopefully clarify this question a bit, here a sample code:

from lxml import etree

root = etree.Element("root")
root.text = u'\uffff'
root.text += u'\ud800' 

print(etree.tostring(root))

root.text += '\x02'.decode("utf-8")

Executing this gives the result

<root>&#65535;&#55296;</root>

Traceback (most recent call last):
  File "[…]", line 9, in <module>
    root.text += u'\u0002'
  File "lxml.etree.pyx", line 953, in lxml.etree._Element.text.__set__ (src/lxml/lxml.etree.c:44956)
  File "apihelpers.pxi", line 677, in lxml.etree._setNodeText (src/lxml/lxml.etree.c:20273)
  File "apihelpers.pxi", line 1395, in lxml.etree._utf8 (src/lxml/lxml.etree.c:26485)
ValueError: All strings must be XML compatible: Unicode or ASCII, no NULL bytes or control characters

As you see, an exception is thrown for the 2 byte, but lxml happily escapes the other two out of range characters. The real trouble is that

s = "<root>&#65535;&#55296;</root>"
root = etree.fromstring(s)

also throws an exception. This behavior is a bit unnerving in my opinion, especially because it produces invalid XML documents.

Turns out that this could be a 2 vs. 3 problem. With python3.4, the code above throws the exception

Traceback (most recent call last):
  File "[…]", line 5, in <module>
    root.text += u'\ud800'
  File "lxml.etree.pyx", line 953, in lxml.etree._Element.text.__set__ (src/lxml/lxml.etree.c:44971)
  File "apihelpers.pxi", line 677, in lxml.etree._setNodeText (src/lxml/lxml.etree.c:20273)
  File "apihelpers.pxi", line 1387, in lxml.etree._utf8 (src/lxml/lxml.etree.c:26380)
UnicodeEncodeError: 'utf-8' codec can't encode character '\ud800' in position 1: surrogates not allowed

The only remaining problem is the \uffff character, which lxml still happily accepts.

Perhaps this should be fixed in lxml itself. Did you submit a bug to the lxml project? — oefe, Jan 02 '15 at 10:01
@oefe I didn't. But it seems that this is a problem of `libxml` (for which lxml is just a wrapper) since PHP's `DOMDocument` (another wrapper) also escapes out-of-range characters und has problems loading such documents afterwards, so maybe a bug report should better be filled there. — Percival Ulysses, Jan 06 '15 at 18:56
as a temp solution you could use `soupparser` that is provided by lxml `from lxml.html.soupparser import fromstring` and it will eat "" with no problem. it based on the parser of libxml2 — Urban48, Jan 18 '15 at 21:33

Lillian Seabreeze · Answer 1 · 2015-01-22T17:17:17.333

Just filter the string before you parse it in LXML: cleaning invalid characters from XML (gist by lawlesst).

I tried it with your code; it seems to work, save the fact that you need to change the gist to import re and sys!

from lxml import etree
from cleaner import invalid_xml_remove

root = etree.Element("root")
root.text = u'\uffff'
root.text += u'\ud800' 

print(etree.tostring(root))

root.text += invalid_xml_remove('\x02'.decode("utf-8"))

Central way to filter invalid unicode chars in lxml?

1 Answers1