Python XPath SyntaxError: invalid predicate

Question

i am trying to parse an xml like

<document>
    <pages>

    <page>   
       <paragraph>XBV</paragraph>

       <paragraph>GHF</paragraph>
    </page>

    <page>
       <paragraph>ash</paragraph>

       <paragraph>lplp</paragraph>
    </page>

    </pages>
</document>

and here is my code

import xml.etree.ElementTree as ET

tree = ET.parse("../../xml/test.xml")

root = tree.getroot()

path="./pages/page/paragraph[text()='GHF']"

print root.findall(path)

but i get an error

print root.findall(path)
  File "X:\Anaconda2\lib\xml\etree\ElementTree.py", line 390, in findall
    return ElementPath.findall(self, path, namespaces)
  File "X:\Anaconda2\lib\xml\etree\ElementPath.py", line 293, in findall
    return list(iterfind(elem, path, namespaces))
  File "X:\Anaconda2\lib\xml\etree\ElementPath.py", line 263, in iterfind
    selector.append(ops[token[0]](next, token))
  File "X:\Anaconda2\lib\xml\etree\ElementPath.py", line 224, in prepare_predicate
    raise SyntaxError("invalid predicate")
SyntaxError: invalid predicate

what is wrong with my xpath?

Follow up

Thanks falsetru, your solution worked. I have a follow up. Now, i want to get all the paragraph elements that come before the paragraph with text GHF. So in this case i only need the XBV element. I want to ignore the ash and lplp. i guess one way to do this would be

result = []
for para in root.findall('./pages/page/'):
    t = para.text.encode("utf-8", "ignore")
    if t == "GHF":
       break
    else:
        result.append(para)

but is there a better way to do this?

score 21 · Accepted Answer · edited Jan 17 '23 at 07:43

21

ElementTree's XPath support is limited. Use other library like lxml:

import lxml.etree
root = lxml.etree.parse('test.xml')

path = "./pages/page/paragraph[text()='GHF']"
print(root.xpath(path))

edited Jan 17 '23 at 07:43

Jean-Francois T.

11,549
7
68
107

answered Nov 20 '15 at 15:59

falsetru

357,413
63
732
636

1

thanks man! can i also do something like text.contains("something") and text.notContains("something")? – AbtPst Nov 20 '15 at 16:10
1

@AbtPst, You can: `path="./pages/page/paragraph[contains(text(),'something')]" ` / `path="./pages/page/paragraph[not(contains(text(),'something'))]"` – falsetru Nov 20 '15 at 16:24
1

No you can not for `find_all` http://stackoverflow.com/questions/2637760/how-do-i-match-contents-of-an-element-in-xpath-lxml since `def prepare_predicate(next, token)` fails – Learner Nov 20 '15 at 16:25
thanks man! that worked. it gives me all the elements i need. just a followup. now what if i want all the paragraphs before i see the 'GHF' paragraph? Once i see the paragraph with the text 'GHF', i want to ignore everything else that comes after it. can i do that? – AbtPst Nov 20 '15 at 16:30
@AbtPst, Sorry, I don't get it. Please post another question. – falsetru Nov 20 '15 at 16:37
thanks, i have updated the question. please take a look – AbtPst Nov 20 '15 at 16:44
1

@AbtPst, Please post a separated question instead of updating the current question. – falsetru Nov 20 '15 at 23:58

score 6 · Answer 2 · answered Dec 21 '17 at 11:45

As @falsetru mentioned, ElementTree doesn't support text() predicate, but it supports matching child element by text, so in this example, it is possible to search for a page that has a paragraph with specific text, using the path ./pages/page[paragraph='GHF']. The problem here is that there are multiple paragraph tags in a page, so one would have to iterate for the specific paragraph. In my case, I needed to find the version of a dependency in a maven pom.xml, and there is only a single version child so the following worked:

In [1]: import xml.etree.ElementTree as ET

In [2] ns = {"pom": "http://maven.apache.org/POM/4.0.0"}

In [3] print ET.parse("pom.xml").findall(".//pom:dependencies/pom:dependency[pom:artifactId='some-artifact-with-hardcoded-version']/pom:version", ns)[0].text
Out[1]: '1.2.3'

Python XPath SyntaxError: invalid predicate

2 Answers2

Linked