1

I need to process an approximately 8Gb large .XML file. The file structure is (simplified) similar to the below:

<TopLevelElement>
    <SomeElementList>
        <Element>zzz</Element>
        ....and so on for thousands of rows
    </SomeElementList>
    <Records>
        <RecordType1>
            <RecordItem id="aaaa">
                <SomeData>
                    <SomeMoreData NameType="xxx">
                        <NameComponent1>zzz</NameComponent1>
                        ....
                        <AnotherNameComponent>zzzz</AnotherNameComponent>
                    </SomeMoreData>
                </SomeData>
            </RecordItem>
        ..... hundreds of thousands of items, some are quite large.
        </RecordType1>
        <RecordType2>
            <RecordItem id="cccc">
            ...hundreds of thousands of RecordType2 elements, slightly different from RecordItems in RecordType1 
            </RecordItem>
        </RecordType2>
    </Records>
</TopLevelElement>

I need to extract some of the sub-elements in RecordType1 and RecordType2 elements. There are conditions to determine which record items need to be processed and which fields need to be extracted. The individual RecordItems do not exceed 120k (some have extensive text data, which I do not need).

Here is the code. Function get_all_records receives following inputs: a) path to the XML file; b) record category ('RecordType1' or 'RecordType2'); c) what name components to pick

from xml.etree import cElementTree as ET

def get_all_records(xml_file_path, record_category, name_types, name_components):
    context = ET.iterparse(xml_file_path, events=("start", "end"))
    context = iter(context)
    event, root = next(context)
    all_records = []
    for event, elem in context:
        if event == 'end' and elem.tag == record_category and elem.attrib['action'] != 'del':
            record_contents = get_record(elem, name_types=name_types, name_components=name_components, record_id=elem.attrib['id'])
            if record_contents:
                all_records += record_contents
            root.clear()
    return all_records

I have experimented with the number of records, the code nicely processes 100k RecordItems (only Type1, it just takes too long to get to Type2) in approximately one minute. Attempting to process a larger number of records (I took one million), eventually leads to MemoryError in ElementTree.py. So I am guessing no memory is released despite of root.clear() statement.

An ideal solution would be one where the RecordItems would be read one at the time, processed, and then discarded from the memory, but I have no clue how to do that. From XML point of view the two extra layers of elements (TopLevelElement and Records) seem to complicate the task. I am new to XML and to respective Python libraries so an explanation with detail would be much appreciated!

Raits
  • 85
  • 9
  • 1
    Instead of build a list `all_records` containing all the matched records, can you just perform your processing at the point where you currently have `all_records += record_conents`? Building of that list is probably what's eating your memory. – larsks Aug 13 '21 at 17:55
  • Hi, all_records contains only processed records, and a single item in all_records is less than 100 bytes. I am building a single list, because ultimately I need to export it to a .CSV file. – Raits Aug 13 '21 at 18:03
  • 1
    If you *don't* build the list, do you still run out of memory? If the problem persists, obviously I'm on the wrong track, but it seems worth a try. If you're just outputting a CSV file, you can write out records iteratively as you read them in -- you don't need to build a list and write it out all at once. – larsks Aug 13 '21 at 18:11
  • I ran my code without building the big list, and got same `ElementTree.py", line 1224, in iterator data = source.read(16 * 1024) MemoryError`. I was able to process about 960k records. In addition, the program froze at random intervals for 2 to 20 minute each time. I also tried to process 'RecordType2' (which come after RecordType1), and those were never reached (MemoryError) again. Unless it is some bug in iterparse itself, it must be something wrong with how I iterate through the XML file. – Raits Aug 16 '21 at 06:25
  • Darn, guess I was on the wrong track. Sorry about that! – larsks Aug 16 '21 at 13:21

1 Answers1

2

Iterating over a huge XML file is always painful.

I'll go over all the process from start to finish, suggesting the best practices for keeping low memory yet maximizing parsing speed.

First no need to store ET.iterparse as a variable. Just iterate over it like

for event, elem in ET.iterparse(xml_file, events=("start", "end")): This iterator created for, well..., iteration without storing anything else in memory except the current tag. Also you don't need root.clear() with this new approach and you can go as long as your hard disk space allows it for huge XML files.

Your code should look like:

from xml.etree import cElementTree as ET

def get_all_records(xml_file_path, record_category, name_types, name_components):
    all_records = []
    for event, elem in ET.iterparse(xml_file_path, events=("start", "end")):
        if event == 'end' and elem.tag == record_category and elem.attrib['action'] != 'del':
            record_contents = get_record(elem, name_types=name_types, name_components=name_components, record_id=elem.attrib['id'])
            if record_contents:
                all_records += record_contents
    return all_records

Also, please think carefully about the reason you need to store the whole list of all_records. If it's only for writing CSV file at the end of the process - this reason isn't good enough and can cause memory issues when scaling to even bigger XML files.

Make sure you write each new row to CSV as this row happens, turning memory issues into none-issue.

P.S.

If you need to store several tags before you find your main tag in order to parse this historic information as you go down the XML file - just store it locally in some new variables. This comes handy whenever future data in XML file makes you go backwards to a specific tag you know already occured.

Pavel Gomon
  • 213
  • 1
  • 7