I need to process an approximately 8Gb large .XML file. The file structure is (simplified) similar to the below:
<TopLevelElement>
<SomeElementList>
<Element>zzz</Element>
....and so on for thousands of rows
</SomeElementList>
<Records>
<RecordType1>
<RecordItem id="aaaa">
<SomeData>
<SomeMoreData NameType="xxx">
<NameComponent1>zzz</NameComponent1>
....
<AnotherNameComponent>zzzz</AnotherNameComponent>
</SomeMoreData>
</SomeData>
</RecordItem>
..... hundreds of thousands of items, some are quite large.
</RecordType1>
<RecordType2>
<RecordItem id="cccc">
...hundreds of thousands of RecordType2 elements, slightly different from RecordItems in RecordType1
</RecordItem>
</RecordType2>
</Records>
</TopLevelElement>
I need to extract some of the sub-elements in RecordType1 and RecordType2 elements. There are conditions to determine which record items need to be processed and which fields need to be extracted. The individual RecordItems do not exceed 120k (some have extensive text data, which I do not need).
Here is the code. Function get_all_records receives following inputs: a) path to the XML file; b) record category ('RecordType1' or 'RecordType2'); c) what name components to pick
from xml.etree import cElementTree as ET
def get_all_records(xml_file_path, record_category, name_types, name_components):
context = ET.iterparse(xml_file_path, events=("start", "end"))
context = iter(context)
event, root = next(context)
all_records = []
for event, elem in context:
if event == 'end' and elem.tag == record_category and elem.attrib['action'] != 'del':
record_contents = get_record(elem, name_types=name_types, name_components=name_components, record_id=elem.attrib['id'])
if record_contents:
all_records += record_contents
root.clear()
return all_records
I have experimented with the number of records, the code nicely processes 100k RecordItems (only Type1, it just takes too long to get to Type2) in approximately one minute. Attempting to process a larger number of records (I took one million), eventually leads to MemoryError in ElementTree.py. So I am guessing no memory is released despite of root.clear() statement.
An ideal solution would be one where the RecordItems would be read one at the time, processed, and then discarded from the memory, but I have no clue how to do that. From XML point of view the two extra layers of elements (TopLevelElement and Records) seem to complicate the task. I am new to XML and to respective Python libraries so an explanation with detail would be much appreciated!