I'm trying to use the pattern described in the "event-driven parsing" section of the lxml
tutorial.
In my code I'm calling a function that can recursively run on elements using the iterchildren()
method. I'll just use two nested loop for illustration here.
This works as expected:
xml = StringIO("<root><a><b>data</b><c><d/></c></a><a><z/></a></root>")
for ev, elem in etree.iterparse(xml):
if elem.tag == 'a':
for c in elem.iterchildren():
for gc in c.iterchildren():
print gc
The output is <Element d at 0x2df49b0>
.
But if I add .clear()
in the end:
for ev, elem in etree.iterparse(xml):
if elem.tag == 'a':
for c in elem.iterchildren():
for gc in c.iterchildren():
print gc
elem.clear()
-- it doesn't print anything. Why is it so and what do I do to work around this?
Notes:
- I can skip
iterchildren
and dofor c in elem
orfor c in list(elem)
, with the same effect. - I need to use iterative approach to keep the memory usage low.
In the real use case, I am doing an element lookup using an attribute:
if elem.attrib.get('id') == elem_id: return _get_info(elem)
I would like an explanation of how clear
manages to erase the inner elements before they are processed, and how to keep them in memory while they're needed for processing of the ancestors.