-1

Suppose I have the following simplified nested HTML list:

<ol>
  <li>Item 1</li>
  <li>Item 2
    <ul>
      <li>Item 2 1</li>
    </ul>
  </li>
  <li>Item 3</li>
</ol>

and now I’d like to visit every text node while iterating of the list items:

for li in xml.xpath(".//li"):
    for t in li.xpath(".//text()"):
        print(t)

However, this prints Item 2 1 twice because that text node is the descendant of two li nodes. So I want to select only those text nodes whose ancestor li is the current/context list item, so to avoid multi-selecting text nodes in nested list items. Something like

li.xpath(".//text[ancestor::li[1] == .]")

but that’s an invalid expression.

How do I do that? (This is using lxml which builds on libxml2 which implements XPath 1.0).

ErikMD
  • 13,377
  • 3
  • 35
  • 71
Jens
  • 8,423
  • 9
  • 58
  • 78

2 Answers2

0

If I understand you correctly, this should work:

for t in xml.xpath('//li[text()]'):
    print(t.text.strip()

Output:

Item 1
Item 2
Item 2 1
Item 3
Jack Fleeting
  • 24,385
  • 6
  • 23
  • 45
0

First, it can be noted that the following XPath 1.0 expression:

.//text()

is a shortcut (some "syntactic sugar") for descendant-or-self::text() − one of the thirteen XPath 1.0 axes.

So if you want to only get the text nodes that are "at the same level of the current node" (actually its direct children), you should just use the axis child::text(). This is the default axis BTW, so you can just write text().

Relying on the example from your question:

#!/usr/bin/env python3
from lxml import etree
with open('./a.xml') as data:
    xml = etree.parse(data)
    for li in xml.xpath(".//li"):
        print(li.xpath("text()"))

will output

['Item 1']
['Item 2\n    ', '\n  ']
['Item 2 1']
['Item 3']
ErikMD
  • 13,377
  • 3
  • 35
  • 71
  • The `text()` would not work for `
  • Some text
  • ` as it would not select the text nodes of `li`’s descendants. – Jens Jan 30 '21 at 22:03
  • But that's what you wanted, didn't you? – ErikMD Jan 30 '21 at 22:04
  • Given the outer loop will go into these nested `li` nodes… – ErikMD Jan 30 '21 at 22:04