0

Am currently working on parsing XML in Python 3.x, for XML size till 300 MB not facing any issues with below code. However when file size increases to 500 MB or in GB, memory issues are being faced.

tree2=etree.parse(xmlfile2)
root2=tree2.getroot()
df_list2=[]
for i, child in enumerate(root2):
    for subchildren in (child.findall('{raml20.xsd}header')):
        for subchildren in (child.findall('{raml20.xsd}managedObject')):
            xml_class_name2 = subchildren.get('class')
            xml_dist_name2 = subchildren.get('distName')
            for subchild in subchildren:
                df_dict2=OrderedDict()
                header2=subchild.attrib.get('name')
                df_dict2['MOClass']=xml_class_name2
                df_dict2['CellDN']=xml_dist_name2
                df_dict2['Parameter']=header2
                df_dict2['CurrentValue']=subchild.text
                df_list2.append(df_dict2)

Came across various articles explaining use of 'iterparse', but am not getting a way through to use it for saving the XML data in ordered way. Below is format of my XML:

<raml version="2.0" xmlns="raml20.xsd">
  <cmData type="plan" scope="all" name="XML_Plan_update.xml">
    <header>
      <log dateTime="2018-12-31T16:13:28" action="created" appInfo="PlanExporter"/>
    </header>
    <managedObject class="WNCEL" version="LN2.0" distName="PLMN-PLMN/MRBTS-137/WNBTS-1/WNCEL-27046" operation="update">
      <p name="defaultCarrier">10787</p>
      <p name="lCelwDN">MRBTS-137/MNL-1/MNLENT-1/CELLMAPPING-1/LCELW-4</p>
      <p name="maxCarrierPower">460</p>
    </managedObject>
    <managedObject class="WNCEL" version="LN2.0" distName="PLMN-PLMN/MRBTS-6770/WNBTS-1/WNCEL-26925" operation="update">
      <p name="defaultCarrier">10787</p>
      <p name="lCelwDN">MRBTS-6770/MNL-1/MNLENT-1/CELLMAPPING-1/LCELW-5</p>
      <p name="maxCarrierPower">460</p>
    </managedObject>
    <managedObject class="WNCEL" version="LN2.0" distName="PLMN-PLMN/MRBTS-806/WNBTS-1/WNCEL-22661" operation="update">
      <p name="defaultCarrier">10762</p>
      <p name="lCelwDN">MRBTS-806/MNL-1/MNLENT-1/CELLMAPPING-1/LCELW-9</p>
      <p name="maxCarrierPower">460</p>
    </managedObject>

Am currently using cElementTree or lxml to parse the XML and save the for loop generated output in Ordered Dictionary. All entries of dict are appended in list at the end. Looking for a way to use iterparse method for parsing above XML in ordered dict.

  • Since Python 3.3, there is no difference between ElementTree and cElementTree: https://docs.python.org/3/whatsnew/3.3.html#xml-etree-elementtree. – mzjn Mar 08 '19 at 06:14
  • 1
    Try this approach [Iterparse big XML, get all, even nested, Sequence Elements](https://stackoverflow.com/a/53883799/7414759). Change `tag=['entity']` to `tag=['managedObject']` and adjust `def __iter__(...` to your needs. – stovfl Mar 08 '19 at 08:28

0 Answers0