1

I would like to read quite big XML as a stream. But could not find any way to use my old XPathes to find elements. Previously files were of moderate size, so in was enough to:

all_elements = []
for xpath in list_of_xpathes:
    all_elements.append(etree.parse(file).getroot().findall(xpath))

Now I am struggling with iterparse. Ideally the solution would be to compare path of current element with desired xpath:

import lxml.etree as et

xml_file = r"my.xml" # quite big xml, that i should read
xml_paths = ['/some/arbitrary/xpath', '/another/xpath']

all_elements = []
iter = et.iterparse(xml_file, events = ('end',))
for event, element in iter:
    for xpath in xml_paths:
        if element_complies_with_xpath(element, xpath):
            all_elements.append(element)
            break

How is it possible to implement element_complies_with_xpath function using lxml?

Alexandr Crit
  • 111
  • 1
  • 14
  • AFAIK - you cannot compare XPath (which requires reading _entire_ document in memory) with `iterparse` that iteratively reads current tags and ideally discards it. You may need to break apart your hopefully simple XPath into a parent-child relationship and conditionally check `tag` names as you walk down tree. May not work for complex XPath. – Parfait Jul 05 '22 at 19:04
  • Sadly, those XPathes are external to my code. If I would to break them up, i’d have to duplicate xml find algorithm to tokenize path and search for appropriate element to match. I tried bypassing this problem with xpath editing like: element.getroot().xpath(element.gettree().getpath(element)+” and “+my xpath). Or search ancestors till this path matches. So that returned elements would match both current element and desired xpath. But I could not construct valid xpath expression. – Alexandr Crit Jul 05 '22 at 19:23

1 Answers1

0

If first part of the xpath can be extracted then the rest could be tested as follows. Instead of a list of strings, a dict of <first element name>: <rest of the xpath> could be used. Parent element could be used as dict key also.
Full xpath: /some/arbitrary/xpath
dict : {'some': './arbitrary/xpath'}

import lxml.etree as et

def element_complies_with_xpath(element, xpath):
    children = element.xpath(xpath)
    print([ "child:" + x.tag for x in children])
    return len(children) > 0

xml_file = r"/home/lmc/tmp/test.xml" # quite big xml, that i should read
xml_paths = [{'membership': './users/user'}, {'entry':'author/name'}]

all_elements = []
iter1 = et.iterparse(xml_file, events = ('end',))

for event, element in iter1:
    for d in xml_paths:
        if element.tag in d and element_complies_with_xpath(element, d[element.tag]):
            all_elements.append(element)
            break

print([x.tag for x in all_elements])

count() xpath function could be used also

def element_complies_with_xpath(element, xpath):
    children = element.xpath(xpath)
    print( f"child exist: {children}")
    return children

xml_file = r"/home/luis/tmp/test.xml" # quite big xml, that i should read
xml_paths = [{'membership': 'count(./users/user) > 0'}, {'entry':'count(author/name) > 0'}]
LMC
  • 10,453
  • 2
  • 27
  • 52
  • I tried this approach. It do have some problems. For example: ./[attr=value], does not fit to your suggestions. You have to make many assumptions during parsing. – Alexandr Crit Jul 06 '22 at 05:20
  • You shoud use parent element as key in those cases `"parent": "./[attr=value]"` or root element. – LMC Jul 06 '22 at 13:37