i am trying to parse an xml like
<document>
<pages>
<page>
<paragraph>XBV</paragraph>
<paragraph>GHF</paragraph>
</page>
<page>
<paragraph>ash</paragraph>
<paragraph>lplp</paragraph>
</page>
</pages>
</document>
and here is my code
import xml.etree.ElementTree as ET
tree = ET.parse("../../xml/test.xml")
root = tree.getroot()
path="./pages/page/paragraph[text()='GHF']"
print root.findall(path)
but i get an error
print root.findall(path)
File "X:\Anaconda2\lib\xml\etree\ElementTree.py", line 390, in findall
return ElementPath.findall(self, path, namespaces)
File "X:\Anaconda2\lib\xml\etree\ElementPath.py", line 293, in findall
return list(iterfind(elem, path, namespaces))
File "X:\Anaconda2\lib\xml\etree\ElementPath.py", line 263, in iterfind
selector.append(ops[token[0]](next, token))
File "X:\Anaconda2\lib\xml\etree\ElementPath.py", line 224, in prepare_predicate
raise SyntaxError("invalid predicate")
SyntaxError: invalid predicate
what is wrong with my xpath?
Follow up
Thanks falsetru, your solution worked. I have a follow up. Now, i want to get all the paragraph elements that come before the paragraph with text GHF
. So in this case i only need the XBV
element. I want to ignore the ash
and lplp
. i guess one way to do this would be
result = []
for para in root.findall('./pages/page/'):
t = para.text.encode("utf-8", "ignore")
if t == "GHF":
break
else:
result.append(para)
but is there a better way to do this?