Why is text of HTML node empty with HTMLParser?

Question

In the following example I am expecting to get Foo for the <h2> text:

from io import StringIO
from html5lib import HTMLParser

fp = StringIO('''
<!DOCTYPE html>
<html xmlns="http://www.w3.org/1999/xhtml">
    <body>
        <h2>
            <span class="section-number">1. </span>
            Foo
            <a class="headerlink" href="#foo">¶</a>
        </h2>
    </body>
</html>
''')

etree = HTMLParser(namespaceHTMLElements=False).parse(fp)
h2 = etree.findall('.//h2')[0]

h2.text

Unfortunately I get ''. Why?

Strangly, foo is in the text:

>>> list(h2.itertext())
['1. ', 'Foo', '¶']

>>> h2.getchildren()
[<Element 'span' at 0x7fa54c6a1bd8>, <Element 'a' at 0x7fa54c6a1c78>]

>>> [node.text for node in h2.getchildren()]
['1. ', '¶']

So where is Foo?

why are you expecting "Foo" when h2 contains no text. the child element _span_ on the other hand, does. — WiseDev, Aug 06 '19 at 12:17
@BlueRineS `` is closed before `Foo` so `Foo` is not in ``... — nowox, Aug 06 '19 at 12:18
@BlueRineS I have edited my question, I do not find `Foo` in `span`. Yes I have search into the manual. — nowox, Aug 06 '19 at 12:20

Matt L. · Accepted Answer · 2019-08-06T12:32:51.210

2

I think you are one level too shallow in the tree. Try this:

from io import StringIO
from html5lib import HTMLParser

fp = StringIO('''
<!DOCTYPE html>
<html xmlns="http://www.w3.org/1999/xhtml">
    <body>
        <h2>
            <span class="section-number">1. </span>
            Foo
            <a class="headerlink" href="#foo">¶</a>
        </h2>
    </body>
</html>
''')

etree = HTMLParser(namespaceHTMLElements=False).parse(fp)
etree.findall('.//h2')[0][0].tail

More generally, to crawl all text and tail, try a loop like this:

for u in etree.findall('.//h2')[0]:
    print(u.text, u.tail)

edited Aug 06 '19 at 12:32

answered Aug 06 '19 at 12:21

Matt L.

3,431
1
15
28

Ok I understand, but I don't understand... :) It seems `Foo` is part of `span.tail` despite it is ouside of `span`. – nowox Aug 06 '19 at 12:26
1

@nowox - It does seem like a parsing error. If you parse the html with lxml, you find `Foo` inside the `h2` text, as it should be. – Jack Fleeting Aug 06 '19 at 12:46
@JackFleeting Do you mean `HTMLParser` is buggy? – nowox Aug 06 '19 at 13:20
@nowox - That may well be the case, though I don't know enough about it to be definitive. I can say that lxml's default parser works as expected. – Jack Fleeting Aug 06 '19 at 13:21
@JackFleeting How would you modify my example to use lxml? – nowox Aug 06 '19 at 13:25
@nowox - sure; see answer below (or above, no idea where SO is going to stick it...) – Jack Fleeting Aug 06 '19 at 13:29
1

@nowox I think you've misunderstood how `ElementTree` represents data: `el.text` is the child of the element if it is a text node; `el.tail` is the following sibling to the element if it is a text node. See also, e.g., [this Q](https://stackoverflow.com/questions/37062825/traversing-tei-in-python-3-text-comes-up-empty-for-some-entities/37063050) and [the Python documentation](https://docs.python.org/3/library/xml.etree.elementtree.html#xml.etree.ElementTree.Element.text). – gsnedders Aug 21 '19 at 13:00

score 0 · Answer 2 · answered Aug 06 '19 at 13:28

0

Using lxml:

fp2 = '''
<!DOCTYPE html>
<html xmlns="http://www.w3.org/1999/xhtml">
    <body>
        <h2>
            <span class="section-number">1. </span>
            Foo
            <a class="headerlink" href="#foo">¶</a>
        </h2>
    </body>
</html>
'''

import lxml.html
tree = lxml.html.fromstring(fp2)

for item in tree.xpath('//h2'):
    target = item.text_content().strip()
    print(target.split('\n')[1].strip())

Output:

Foo

answered Aug 06 '19 at 13:28

Jack Fleeting

24,385
6
23
45

Thanks :) So I wasn't using the same libary then (`from lxml import etree`) – nowox Aug 06 '19 at 13:29
@nowox - no; they just sound the same :) – Jack Fleeting Aug 06 '19 at 13:30
Note that the tree `lxml` produces here is no different to the one `html5lib` produces; the "foo" becomes the `item.children[0].tail` in the above code, so this doesn't answer why the it's not in `item.text`. – gsnedders Aug 21 '19 at 12:56
@gsnedders - I have no idea what you're talking about; OP asked for an `lxml` solution, not for an answer to what you think is the question. – Jack Fleeting Aug 21 '19 at 13:46
OP asked why it wasn't in `h2.text` (which you don't answer), and "where is Foo?" (which, okay, you get from `h2.text_content()`, but it's unsurprising it appears in the sum of all of the text node descendants). You claimed in a comment this was a parsing error with html5lib, despite it appearing in the same place in lxml. – gsnedders Aug 21 '19 at 16:48
@gsnedders - No, op just asked "@JackFleeting How would you modify my example to use lxml?", which is what I did - but whatever. – Jack Fleeting Aug 21 '19 at 18:02

Why is text of HTML node empty with HTMLParser?

2 Answers2