0

I am attempting to split a large XML document (with 88,645 lines) into multiple XMLs based on a specific node. That specific node is <project>. The structure of the large XML document is as follows:

<?xml version="1.0" encoding="UTF-8"?>
<projects>
  <project>
    <projectNumber>738951</projectNumber>
    <projectType>CHANGE REQUEST</projectType>
    <lineOfBusiness>COMMERCIAL</lineOfBusiness>
     ...
  </project>    

My goal is to split the document to look something like this:

XML 1:

<?xml version="1.0" encoding="UTF-8"?>
<project>
   <projectNumber>738951</projectNumber>
   <projectType>CHANGE REQUEST</projectType>
   <lineOfBusiness>COMMERCIAL</lineOfBusiness>
     ...
</project>    

XML 2:

<?xml version="1.0" encoding="UTF-8"?>
<project>
   <projectNumber>738951</projectNumber>
   <projectType>CHANGE REQUEST</projectType>
   <lineOfBusiness>COMMERCIAL</lineOfBusiness>
     ...
</project>

and so on. Although, instead of writing the XML code I want to feed it the actual (large) XML document.

The following is my initial code based on writing the XML code (but again, I want to feed Python the actual XML document to read):

import xml.etree.ElementTree as ET

xml = '''<projects>
  <project>
    <projectNumber>738951</projectNumber>
    <projectType>CHANGE REQUEST</projectType>
    <lineOfBusiness>COMMERCIAL</lineOfBusiness>
    ...
'''

root = ET.fromstring(xml)
counter = 1

for child in list(root):
    if child.tag.startswith('project'):
        src = ET.Element('project')
        src.append(child)
        with open(f'out_{counter}.xml','w') as f:
            tree = ET.ElementTree(src)
            tree.write(f,encoding="unicode")
        counter += 1
Daniel Haley
  • 51,389
  • 6
  • 69
  • 95
moran29
  • 11
  • 1

2 Answers2

0

You can use the the XMLPullParser as a non-blocking tool and parse partly each project branch:

import xml.etree.ElementTree as ET

parser = ET.XMLPullParser(['start', 'end']) # other  events are comment, pi, start-ns, end-ns

with open("Large.xml", 'r') as f_xml:
    for line in f_xml:
        parser.feed(line)

for event, elem in parser.read_events():
    if event == "end" and elem.tag == "project":
        for tag_elem in elem.iter():
            if tag_elem.tag == "projectNumber":
                print(tag_elem.text)
            if tag_elem.tag == "projectType":
                print(tag_elem.text) 
            if tag_elem.tag == "lineOfBusiness":
                print(tag_elem.text)
Hermann12
  • 1,709
  • 2
  • 5
  • 14
  • Thank you! Although, since i have over 1,000 child nodes for each node is there a way to bypass calling out each child node like you do in the last 8 lines of your code? – moran29 Feb 22 '23 at 16:37
  • There are several possibilities to split and parse your file. But if we don't have more information about the structure, we can't give you some advice. If the project structure is flat maybe pandas dataframe could be an option. But for the splitting my answer works and you can feed project for project. – Hermann12 Feb 22 '23 at 18:36
0

In XSLT 2.0 or later this is:

<xsl:transform xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
   version="2.0">
<xsl:template match="/">
  <xsl:for-each select="/*/project">
    <xsl:result-document href="proj{position()}.xml">
       <xsl:copy-of select="."/>
    </xsl:result-document>
  </xsl:for-each>
</xsl:template>
</xsl:transform>
Michael Kay
  • 156,231
  • 11
  • 92
  • 164