-1

Reading the large StackOverflow XML dump file (Posts.xml ~90 GB) through the following approach

from xml.etree.cElementTree import iterparse

for evt, elem in iterparse("Posts.xml", events=('end',)):
    if elem.tag == 'row':
        user_fields = elem.attrib

cause OOM just iterating over the XML elements (without any memory allocation), even on a 128 GB RAM computer environment.

Since I did not get any info from documentation or other examples in the StackOverflow community, could you help me figure out how to work around it?

Celso França
  • 653
  • 8
  • 31
  • Can you [edit] your question and put there sample input and what information do you want to get? – Andrej Kesely Jul 21 '23 at 17:11
  • 1
    Maybe you specify the `row` tag (using the `tags` arg) and using the `.clear()` method. See https://stackoverflow.com/a/68486945/317052 for an example. – Daniel Haley Jul 21 '23 at 17:15
  • 1
    Oops...the `tags` arg is only available in lxml I think. Maybe start with just using `elem.clear()`. – Daniel Haley Jul 21 '23 at 17:20
  • Thanks @DanielHaley. I've already included a `elem.clear()` without results. The only approach that seems to work is `del elem`. – Celso França Jul 21 '23 at 17:24
  • Any info on how to handle this OOM issue. – Celso França Jul 21 '23 at 18:39
  • Using a powershell script with XmlReader is the only way I know of parsing huge xml. It is taking too long to down load file. If you can post small sample of xml I will write code. – jdweng Jul 21 '23 at 20:23

1 Answers1

1

Based on Daniel Haley's comments, you could try:

from lxml.etree import iterparse # replace xml to lxml

for evt, elem in iterparse("Posts.xml", events=('end',), tag="row"):
    user_fields = elem.attrib
    ...
    elem.clear()
Aldebaran
  • 51
  • 3