Why does lxml.html sometimes swallow/remove whitespace instead of preserving it?

Question

Given the following code, one might reasonably expect almost the exact same string of HTML that was fed into lxml to be to spit back out.

from lxml import html

HTML_TEST_STRING = r"""
<pre>
<em>abc</em>

<em>def</em>

<sub>ghi</sub>

<sub>jkl</sub>

<em>mno</em>

<em>pqr</em>

</pre>
"""

parser = html.HTMLParser( remove_blank_text=False )
doc = html.fromstring( HTML_TEST_STRING, parser=parser )
print( html_out_string )

Instead, even though everything is contained within a <pre> pre-formatted code block, and the remove_blank_text flag is set to False, it only respects the preservation of whitespace for some of the contents, yet mysteriously not for other parts of the content. See the unexpected output of the above code below:

<pre>
<em>abc</em>

<em>def</em>

<sub>ghi</sub><sub>jkl</sub><em>mno</em>

<em>pqr</em>

</pre>

Specifically, whenever lxml encounters a <sub> tag, it goes batty and loses the "tail" text content that follows that sub element (even when that "sub element" arguably isn't even an element—since it's wrapped in a pre element).

naki · Accepted Answer · 2016-03-18T09:07:08.633

The most likely catalyst for this curious behavior is that, like me, you're on Windows and using a Python version that lxml doesn't publish a binary package for.

In such a scenario, one portion of the lxml website points you to the official unofficial Windows binaries for libxml2 so that you [potentially via the pip install script] can build a new lxml binary that supports your Python version. The problem, however, is that the binaries that it links you to are at least 4 years old and contain the bug you're running into.

The easiest solution to this problem is to instead download and then install Christoph Gohlke's unofficial binary archive (a so called "wheel") of lxml that is actually built for your OS/Python variant. (Another section of the lxml website also recommends this, but if you're like me, you ignored that path, wanting to run as little unofficial binary code as reasonably possible.)

(eg. pip3 install --upgrade lxml-3.5.0-cp35-none-win32.whl)

Golke's package is built using a more recent version of libxml2 which has apparently already fixed that bug, so if everything above worked properly, you can now stop wasting hours of your life barking up the wrong 'tree'. You're not using lxml wrong, and it's not that lxml doesn't support preserving whitespace in this scenario (as so many other SO entries might have you think); it's just that you were unwittingly using a version of libxml2 that has a bug that's since been fixed.

With a recent build of libxml2 driving your lxml installation, the output of the sample code you posted will instead produce what you expected (consistently preserved whitespace):

<pre>
<em>abc</em>

<em>def</em>

<sub>ghi</sub>

<sub>jkl</sub>

<em>mno</em>

<em>pqr</em>

</pre>

Why does lxml.html sometimes swallow/remove whitespace instead of preserving it?

1 Answers1

Linked