9

It is common knowledge that certain character ranges aren't allowed in XML documents. I'm aware of solutions to filter those characters out (like [1], [2]).

Going with the Don't Repeat Yourself principle, I would prefer to implement one of these solutions in one central point – right now, I have to sanitize any potentially unsafe text before it is fed to lxml. Is there a way to achieve this, e.g. by subclassing a lxml filter class, catching some exceptions, or setting a configuration switch?


Edit: To hopefully clarify this question a bit, here a sample code:

from lxml import etree

root = etree.Element("root")
root.text = u'\uffff'
root.text += u'\ud800' 

print(etree.tostring(root))

root.text += '\x02'.decode("utf-8")

Executing this gives the result

<root>&#65535;&#55296;</root>

Traceback (most recent call last):
  File "[…]", line 9, in <module>
    root.text += u'\u0002'
  File "lxml.etree.pyx", line 953, in lxml.etree._Element.text.__set__ (src/lxml/lxml.etree.c:44956)
  File "apihelpers.pxi", line 677, in lxml.etree._setNodeText (src/lxml/lxml.etree.c:20273)
  File "apihelpers.pxi", line 1395, in lxml.etree._utf8 (src/lxml/lxml.etree.c:26485)
ValueError: All strings must be XML compatible: Unicode or ASCII, no NULL bytes or control characters

As you see, an exception is thrown for the 2 byte, but lxml happily escapes the other two out of range characters. The real trouble is that

s = "<root>&#65535;&#55296;</root>"
root = etree.fromstring(s)

also throws an exception. This behavior is a bit unnerving in my opinion, especially because it produces invalid XML documents.


Turns out that this could be a 2 vs. 3 problem. With python3.4, the code above throws the exception

Traceback (most recent call last):
  File "[…]", line 5, in <module>
    root.text += u'\ud800'
  File "lxml.etree.pyx", line 953, in lxml.etree._Element.text.__set__ (src/lxml/lxml.etree.c:44971)
  File "apihelpers.pxi", line 677, in lxml.etree._setNodeText (src/lxml/lxml.etree.c:20273)
  File "apihelpers.pxi", line 1387, in lxml.etree._utf8 (src/lxml/lxml.etree.c:26380)
UnicodeEncodeError: 'utf-8' codec can't encode character '\ud800' in position 1: surrogates not allowed

The only remaining problem is the \uffff character, which lxml still happily accepts.

Percival Ulysses
  • 1,133
  • 11
  • 18
  • 2
    Perhaps this should be fixed in lxml itself. Did you submit a bug to the lxml project? – oefe Jan 02 '15 at 10:01
  • 1
    @oefe I didn't. But it seems that this is a problem of `libxml` (for which lxml is just a wrapper) since PHP's `DOMDocument` (another wrapper) also escapes out-of-range characters und has problems loading such documents afterwards, so maybe a bug report should better be filled there. – Percival Ulysses Jan 06 '15 at 18:56
  • 1
    as a temp solution you could use `soupparser` that is provided by lxml `from lxml.html.soupparser import fromstring` and it will eat "" with no problem. it based on the parser of libxml2 – Urban48 Jan 18 '15 at 21:33

1 Answers1

1

Just filter the string before you parse it in LXML: cleaning invalid characters from XML (gist by lawlesst).

I tried it with your code; it seems to work, save the fact that you need to change the gist to import re and sys!

from lxml import etree
from cleaner import invalid_xml_remove

root = etree.Element("root")
root.text = u'\uffff'
root.text += u'\ud800' 

print(etree.tostring(root))

root.text += invalid_xml_remove('\x02'.decode("utf-8"))