I have one large XML file that looks like the following. I would like to split this large XML file into multiple XML files/chunks based on tag. I would like each XML file to have 1000 PRVDR. What is the best way to do this in pyspark? So, file01.xml will have 1000 PRVDR, file02.xml will have 1000 PRVDR etc. I will be reading the file from s3, and output the multiple files back to s3 in different location. If there are easier ways to do this at accelerated speed, please let me know. Examples would be awesome.
For this specific case, the tag PRVDR is unique and therefore there is no need to look into the other elements with in PRVDR for splitting. Just PRVDR where it starts and finishes from row 1 - 1000 for the first file and so on.
<PRVDR>
<PRVDR_INFO>
//MANY OTHER ELEMENTS AND YES SOME ARE NESTED
</PRVDR_INFO>
<ENRLMTS>
<XYZ>
//MANY OTHER ELEMENTS AND YES SOME ARE NESTED
</XYZ>
</ENRLMTS>
</PRVDR>
<PRVDR>
<PRVDR_INFO>
//MANY OTHER ELEMENTS AND YES SOME ARE NESTED
</PRVDR_INFO>
<ENRLMTS>
<XYZ>
//MANY OTHER ELEMENTS AND YES SOME ARE NESTED
</XYZ>
</ENRLMTS>
</PRVDR>