4

I am using lxml.iterparse to parse a rather large xml file. At a certain point an out of memory exception is thrown. I am aware of similar questions and that there is a tree built which you should normaly clear with element.clear() when you are not using it anymore.

My code looks like this (shortened):

for  event,element in context :
    if element.tag == xmlns + 'initialized':        
        attributes = element.findall(xmlns+'attribute')         
        heapsize = filter(lambda x:x.attrib['name']=='maxHeapSize', attributes)[0].attrib['value']
        characteristics['max_heap_size_MB'] = bytes_to_MB(int(heapsize, 16))

    #clear up the built tree to avoid mem alloc fails
    element.clear()
del context

This works if i am commenting out element.clear(). If I am using element.clear I get Keyerrors like this:

Traceback (most recent call last):
  File "C:\Users\NN\Documents\scripts\analyse\analyse_all.py", line 289, in <module>
    main()
  File "C:\Users\NN\Documents\scripts\analyse\analyse_all.py", line 277, in main
    join_characteristics_and_score(logpath, benchmarkscores)
  File "C:\Users\NN\Documents\scripts\analyse\analyse_all.py", line 140, in join_characteristics_and_score
    parsed_verbose_xml  = parse_xml(verbose)
  File "C:\Users\NN\Documents\scripts\analyse\analyze_g.py", line 62, in parse_xml
    heapsize = filter(lambda x:x.attrib['name']=='maxHeapSize', attributes)[0].attrib['value']
  File "C:\Users\NN\Documents\scripts\analyse\analyze_g.py", line 62, in <lambda>
    heapsize = filter(lambda x:x.attrib['name']=='maxHeapSize', attributes)[0].attrib['value']
  File "lxml.etree.pyx", line 2272, in lxml.etree._Attrib.__getitem__ (src\lxml\lxml.etree.c:54751)
KeyError: 'name'

When I am printing the elements they are regular dicts with the values in them without using element.clear(). When clearing, those dicts are empty.

EDIT

a minimal running python program illustrating the problem:

#!/usr/bin/python

from lxml import etree
from pprint import pprint

def fast_iter(context, func, *args, **kwargs):
        # http://www.ibm.com/developerworks/xml/library/x-hiperfparse/
        # Author: Liza Daly
        for event, elem in context:
            func(elem, *args, **kwargs) 
            elem.clear()
            while elem.getprevious() is not None:
                del elem.getparent()[0]
        del context

def process_element(elem):
        xmlns = "{http://www.ibm.com/j9/verbosegc}"

        if elem.tag == xmlns + "gc-start":
            memelements = elem.findall('.//root:mem', namespaces = {'root':xmlns[1:-1]})
            pprint(memelements)

if __name__ == '__main__':
    with open('small.xml', "r+") as xmlf:
                context = etree.iterparse(xmlf)
                fast_iter(context, process_element)

The content of the xmlfile is as follows:

<verbosegc xmlns="http://www.ibm.com/j9/verbosegc">
<gc-start id="5" type="scavenge" contextid="4" timestamp="2013-06-14T15:48:46.815">
  <mem-info id="6" free="3048240" total="4194304" percent="72">
    <mem type="nursery" free="0" total="1048576" percent="0">
      <mem type="allocate" free="0" total="524288" percent="0" />
      <mem type="survivor" free="0" total="524288" percent="0" />
    </mem>
    <mem type="tenure" free="3048240" total="3145728" percent="96">
      <mem type="soa" free="2891568" total="2989056" percent="96" />
      <mem type="loa" free="156672" total="156672" percent="100" />
    </mem>
    <remembered-set count="1593" />
  </mem-info>
</gc-start>
</verbosegc>
Nicolas
  • 1,828
  • 6
  • 23
  • 34
  • Why not use `element.findall('{%s}attribute' % xmlns)` instead? No need to iterate over all subelements. – Martijn Pieters May 23 '13 at 21:20
  • @blender: i get a keyerror when trying to access certain attributes by doing: child.attrib['key'] . without clearing this works – Nicolas May 23 '13 at 21:23
  • @MartijnPieters is there a comma missing after attribute' ? – Nicolas May 23 '13 at 21:24
  • @Nicolas: No, there is not. – Martijn Pieters May 23 '13 at 21:24
  • @Nicolas: Please add a copy of the traceback generated when it "breaks" to your question (with each line indented by 4 spaces or as code for readability). – martineau May 24 '13 at 01:15
  • @martineau I've added the stacktrace to my original question. – Nicolas May 24 '13 at 19:59
  • Ive also changed the code as @MartijnPieters suggested but the result is still the same. – Nicolas May 24 '13 at 19:59
  • @Nicolas: Because the trackback doesn't relate to the shortened code you've posted it is of limited value. However my guess from what is shown is that the `element.clear()` is getting rid of things you're still referring to which were retrieved from the element using the `attributes = element.findall(xmlns+'attribute')`. If you must do the `clear()`, you may have to first make copies of what you want to keep. – martineau May 24 '13 at 20:39
  • how would that work? I don't really get the order of execution. I would expect the if statement to be true at one point, I list all the children I need, I extract the relevant information and only after that is the tree cleared. Obviously this isn't the case and it is cleared before or while finding the children. I have no real idea how to avoid that. – Nicolas May 25 '13 at 16:46
  • @martineau any idea how i could save relevant information if the findall function just returns an empty dict when called? – Nicolas May 27 '13 at 17:11
  • @Nicolas: If `findall()` returns an empty dict, there's nothing to save. More importantly, your code shouldn't execute the following line with the `filter()` call in it (which is where the `KeyError: 'name'` is occurring). – martineau May 27 '13 at 20:40
  • @martineau But it only returns an empty dict when calling element.clear() The filter is only executed when theoretically there should be a filled dict available. – Nicolas May 27 '13 at 22:24
  • @Nicolas: Try wrapping the `helpsize = ... etc` with an `if attributes:` and see what happens. – martineau May 27 '13 at 22:29
  • @martineau nothing will happen. the dicts returned are always empty. – Nicolas May 27 '13 at 22:39

1 Answers1

6

Liza Daly has written a great article about processing large XML using lxml. Try the fast_iter code presented there:

import lxml.etree as ET
import pprint


def fast_iter(context, func, *args, **kwargs):
    """
    http://www.ibm.com/developerworks/xml/library/x-hiperfparse/ (Liza Daly)
    See also http://effbot.org/zone/element-iterparse.htm
    """
    for event, elem in context:
        func(elem, *args, **kwargs)
        # It's safe to call clear() here because no descendants will be
        # accessed
        elem.clear()
        # Also eliminate now-empty references from the root node to elem
        # (ancestor loop added by unutbu)
        for ancestor in elem.xpath('ancestor-or-self::*'):
            while ancestor.getprevious() is not None:
                del ancestor.getparent()[0]
    del context


def process_element(elem, namespaces):
    memelements = elem.findall('.//root:mem', namespaces=namespaces)
    pprint.pprint(memelements)

if __name__ == '__main__':
    xmlns = "http://www.ibm.com/j9/verbosegc"
    namespaces = {'root': xmlns}
    with open('small.xml', "r+") as xmlf:
        context = ET.iterparse(xmlf, events=('end', ),
                               tag='{{{}}}gc-start'.format(xmlns))
        fast_iter(context, process_element, namespaces)
Nate Anderson
  • 18,334
  • 18
  • 100
  • 135
unutbu
  • 842,883
  • 184
  • 1,785
  • 1,677
  • I hoped there is a solution which doesn't really require rewriting my code – Nicolas May 23 '13 at 21:40
  • I tried rewriting the code but I encounter basically the same problem: if i try to do a element.findall inside the process() method it returns an empty list, while without clearing i get the desired children. – Nicolas Jul 04 '13 at 18:01
  • Please post a runnable example, with sample XML which demonstrates the problem. – unutbu Jul 04 '13 at 18:34
  • thanks unutbu. I added a runnable example to my original question. when commenting out elem.clear() and del elem.getparent()[0] this works. perhaps something with the event when this this is started? it seems to work when i change the iterparse event to start instead of finish. but this fails for larger files, python just crashes. if the event is start and end, XMLSyntaxError is thrown with this error: http://stackoverflow.com/questions/6511408/lxml-xmlsyntaxerror-namespace-default-prefix-was-not-found – Nicolas Jul 04 '13 at 19:02
  • @Nicolas: The `fast_iter` function saves memory by deleting elements after they have been processed. Without a `tag` parameter in `iterparse`, `fast_iter` processs *all* tags. The `mem` tag, in particular, get processed and cleared *before* you get to the `gc-start` tag. That is why you were seeing no items returned by `elem.findall`. The solution is to include a `tag` parameter, and use `events=('end', )` so all the `mem` tags inside the `gc-start` tag will have been parsed before you call `process_element` on -- and only on -- the `gc-start` tag. I've edited the post to show what I mean. – unutbu Jul 04 '13 at 21:15
  • Working link to article https://web.archive.org/web/20210309115224/http://www.ibm.com/developerworks/xml/library/x-hiperfparse/ – nijave Oct 31 '22 at 00:17