Given the following code, one might reasonably expect almost the exact same string of HTML that was fed into lxml to be to spit back out.
from lxml import html
HTML_TEST_STRING = r"""
<pre>
<em>abc</em>
<em>def</em>
<sub>ghi</sub>
<sub>jkl</sub>
<em>mno</em>
<em>pqr</em>
</pre>
"""
parser = html.HTMLParser( remove_blank_text=False )
doc = html.fromstring( HTML_TEST_STRING, parser=parser )
print( html_out_string )
Instead, even though everything is contained within a <pre>
pre-formatted code block, and the remove_blank_text
flag is set to False
, it only respects the preservation of whitespace for some of the contents, yet mysteriously not for other parts of the content. See the unexpected output of the above code below:
<pre>
<em>abc</em>
<em>def</em>
<sub>ghi</sub><sub>jkl</sub><em>mno</em>
<em>pqr</em>
</pre>
Specifically, whenever lxml encounters a <sub>
tag, it goes batty and loses the "tail" text content that follows that sub
element (even when that "sub
element" arguably isn't even an element—since it's wrapped in a pre
element).