I am attempting to split a large XML document (with 88,645 lines) into multiple XMLs based on a specific node. That specific node is <project>
. The structure of the large XML document is as follows:
<?xml version="1.0" encoding="UTF-8"?>
<projects>
<project>
<projectNumber>738951</projectNumber>
<projectType>CHANGE REQUEST</projectType>
<lineOfBusiness>COMMERCIAL</lineOfBusiness>
...
</project>
My goal is to split the document to look something like this:
XML 1:
<?xml version="1.0" encoding="UTF-8"?>
<project>
<projectNumber>738951</projectNumber>
<projectType>CHANGE REQUEST</projectType>
<lineOfBusiness>COMMERCIAL</lineOfBusiness>
...
</project>
XML 2:
<?xml version="1.0" encoding="UTF-8"?>
<project>
<projectNumber>738951</projectNumber>
<projectType>CHANGE REQUEST</projectType>
<lineOfBusiness>COMMERCIAL</lineOfBusiness>
...
</project>
and so on. Although, instead of writing the XML code I want to feed it the actual (large) XML document.
The following is my initial code based on writing the XML code (but again, I want to feed Python the actual XML document to read):
import xml.etree.ElementTree as ET
xml = '''<projects>
<project>
<projectNumber>738951</projectNumber>
<projectType>CHANGE REQUEST</projectType>
<lineOfBusiness>COMMERCIAL</lineOfBusiness>
...
'''
root = ET.fromstring(xml)
counter = 1
for child in list(root):
if child.tag.startswith('project'):
src = ET.Element('project')
src.append(child)
with open(f'out_{counter}.xml','w') as f:
tree = ET.ElementTree(src)
tree.write(f,encoding="unicode")
counter += 1