0

In the following example I am expecting to get Foo for the <h2> text:

from io import StringIO
from html5lib import HTMLParser

fp = StringIO('''
<!DOCTYPE html>
<html xmlns="http://www.w3.org/1999/xhtml">
    <body>
        <h2>
            <span class="section-number">1. </span>
            Foo
            <a class="headerlink" href="#foo">¶</a>
        </h2>
    </body>
</html>
''')

etree = HTMLParser(namespaceHTMLElements=False).parse(fp)
h2 = etree.findall('.//h2')[0]

h2.text

Unfortunately I get ''. Why?

Strangly, foo is in the text:

>>> list(h2.itertext())
['1. ', 'Foo', '¶']

>>> h2.getchildren()
[<Element 'span' at 0x7fa54c6a1bd8>, <Element 'a' at 0x7fa54c6a1c78>]

>>> [node.text for node in h2.getchildren()]
['1. ', '¶']

So where is Foo?

nowox
  • 25,978
  • 39
  • 143
  • 293

2 Answers2

2

I think you are one level too shallow in the tree. Try this:

from io import StringIO
from html5lib import HTMLParser

fp = StringIO('''
<!DOCTYPE html>
<html xmlns="http://www.w3.org/1999/xhtml">
    <body>
        <h2>
            <span class="section-number">1. </span>
            Foo
            <a class="headerlink" href="#foo">¶</a>
        </h2>
    </body>
</html>
''')

etree = HTMLParser(namespaceHTMLElements=False).parse(fp)
etree.findall('.//h2')[0][0].tail

More generally, to crawl all text and tail, try a loop like this:

for u in etree.findall('.//h2')[0]:
    print(u.text, u.tail)
Matt L.
  • 3,431
  • 1
  • 15
  • 28
  • Ok I understand, but I don't understand... :) It seems `Foo` is part of `span.tail` despite it is ouside of `span`. – nowox Aug 06 '19 at 12:26
  • 1
    @nowox - It does seem like a parsing error. If you parse the html with lxml, you find `Foo` inside the `h2` text, as it should be. – Jack Fleeting Aug 06 '19 at 12:46
  • @JackFleeting Do you mean `HTMLParser` is buggy? – nowox Aug 06 '19 at 13:20
  • @nowox - That may well be the case, though I don't know enough about it to be definitive. I can say that lxml's default parser works as expected. – Jack Fleeting Aug 06 '19 at 13:21
  • @JackFleeting How would you modify my example to use lxml? – nowox Aug 06 '19 at 13:25
  • @nowox - sure; see answer below (or above, no idea where SO is going to stick it...) – Jack Fleeting Aug 06 '19 at 13:29
  • 1
    @nowox I think you've misunderstood how `ElementTree` represents data: `el.text` is the child of the element if it is a text node; `el.tail` is the following sibling to the element if it is a text node. See also, e.g., [this Q](https://stackoverflow.com/questions/37062825/traversing-tei-in-python-3-text-comes-up-empty-for-some-entities/37063050) and [the Python documentation](https://docs.python.org/3/library/xml.etree.elementtree.html#xml.etree.ElementTree.Element.text). – gsnedders Aug 21 '19 at 13:00
0

Using lxml:

fp2 = '''
<!DOCTYPE html>
<html xmlns="http://www.w3.org/1999/xhtml">
    <body>
        <h2>
            <span class="section-number">1. </span>
            Foo
            <a class="headerlink" href="#foo">¶</a>
        </h2>
    </body>
</html>
'''

import lxml.html
tree = lxml.html.fromstring(fp2)

for item in tree.xpath('//h2'):
    target = item.text_content().strip()
    print(target.split('\n')[1].strip())

Output:

Foo

Jack Fleeting
  • 24,385
  • 6
  • 23
  • 45
  • Thanks :) So I wasn't using the same libary then (`from lxml import etree`) – nowox Aug 06 '19 at 13:29
  • @nowox - no; they just sound the same :) – Jack Fleeting Aug 06 '19 at 13:30
  • Note that the tree `lxml` produces here is no different to the one `html5lib` produces; the "foo" becomes the `item.children[0].tail` in the above code, so this doesn't answer why the it's not in `item.text`. – gsnedders Aug 21 '19 at 12:56
  • @gsnedders - I have no idea what you're talking about; OP asked for an `lxml` solution, not for an answer to what you think is the question. – Jack Fleeting Aug 21 '19 at 13:46
  • OP asked why it wasn't in `h2.text` (which you don't answer), and "where is Foo?" (which, okay, you get from `h2.text_content()`, but it's unsurprising it appears in the sum of all of the text node descendants). You claimed in a comment this was a parsing error with html5lib, despite it appearing in the same place in lxml. – gsnedders Aug 21 '19 at 16:48
  • @gsnedders - No, op just asked "@JackFleeting How would you modify my example to use lxml?", which is what I did - but whatever. – Jack Fleeting Aug 21 '19 at 18:02