Processing large xml files. Only root tree children attributes are relevant

Question

I'm new to xml and python and I hope that I phrased my problem right:

I have xml files with a size of one gigabyte. The files look like this:

<test name="LongTestname" result="PASS">
    <step ID="0" step="NameOfStep1" result="PASS">
        Stuff I dont't care about
    </step>
    <step ID="1" step="NameOfStep2" result="PASS">
        Stuff I dont't care about
    </step>
</test>

For fast analysis I want to get the name and the result of the steps which are the children of the root element. Stuff I dont't care about are lots of nested elements.

I have already tried following:

tree = ET.parse(xmlLocation)
root = tree.getroot()
for child in root:
    print(child.tag, child.attrib)

Here I get a memory error because the files are to big

Then I tried:

try:
    for event, elem in ET.iterparse(pathToSteps, events=("start","end")):
       if elem.tag == "step" and event == "start":
                        
           stepAndResult.append([elem.attrib['step'],elem.attrib['result'],"System1"])
       elem.clear()

This works but is really slow. I guess it iterates through all elements and this takes a very long time.

Then I found a solution looking like this:

try:
    tree = ET.iterparse(pathToSteps, events=("start","end"))
    _, root = next(tree)  
    print('ROOT:', root.tag)
except:
   print("ERROR: Unable to open and parse file !!!")


for child in root:
   print(child.attrib)

But this prints only the attributes of the first step.

Is there a way to speed up the working solution? Since I'm pretty new to this stuff I would appreciate a complete example or a reference where I can figure it out by myself with an example.

score 0 · Answer 1 · answered Jul 22 '21 at 13:30

Without knowing the specifics of your setup, it might be hard to guess what the 'fastest possible' might be and how much of the delay is due to the parsing of the file. The first thing I would do, is of course time the run so you have some initial benchmark. Then I would write a simple python program that does nothing else but read the file from disk (no XML parsing). If the time difference is not significant, then the XML parsing isn't the issue and it is the reading of the file from disk is the problem. Of course, in an XML document, there is no indication in the file itself where the next tag ends so skipping the IO associated with those portions isn't possible (you still need to do a linear read of the file). Other than potentially using a different programming language (non-interpreted), there may not be many things you can do.

If you do get a significant slowdown from the actual XML parsing, you could then potentially try to pre-process the file into a different one. Since the file format of your files is very static, you could read the file and output to a different file (using a regex) until you get the tag. Then just throw out the data until you close the </step> tag or </test> tag. That will result in a valid, but hopefully much smaller XML file. The key here would be to do the 'parsing' yourself instead of having the underlying parser try to understand all of the document format, which could be much faster since your format is simple. You could then run your original program on this output which will not 'see' any of the extraneous tags. Of course, this breaks if you actually have nested <step> tags, but if that is the case, then you likely need to parse the file with a real XML parser to understand where the first-level starts and stops.

score 0 · Accepted Answer · answered Jul 22 '21 at 14:47

0

I think you're on the right track with iterparse().

Maybe try specifying the step element name in the tag argument and only processing "start" events...

from lxml import etree

for event, elem in etree.iterparse("input.xml", tag="step", events=("start",)):
    print(elem.attrib)
    elem.clear()

EDIT: For some reason I thought you were using lxml and not ElementTree. My answer would require you to switch to lxml.

answered Jul 22 '21 at 14:47

Daniel Haley

51,389
6
69
95

Just having a try at `lxml` may be worth the effort... if you are working with Python 3.3 or newer `lxml` and the standard `ElementTree` module tend to have similar performances, but one or the other can be noticeably faster in specific cases; if you are working with older versions then `ElementTree` is definitely slower, although you may revert to `cElementTree` to get comparable performances – gimix Jul 23 '21 at 16:22
@Daniel Haley, thank you for your answer. I try to get lxml running. Company PC with restricted rights... I hope I can get it running this week. – JackZ Jul 26 '21 at 09:52

Processing large xml files. Only root tree children attributes are relevant

2 Answers2

Linked