0

I have seen many questions and answers regarding reading large XML files, but none of them seem to have provided true multi-core performance to a python process parsing XML. I have just started using an excellent framework for managing multiprocessing in python called ray, and I was curious if I could implement it to parse large XML files. Currently I am using ElementTree.iterparse, which was intuitive to implement but can take upwards of an hour to read in a large file in its serialized format. I'm not exactly sure where to begin to maximize disk bandwidth nor what software already exists to assist this.

Note that in my case there are no dependencies in the data. Each element of the root element of the XML file is independent entity.

  • To clarify, you want to parallelize the reading of a single large xml file? – tdelaney Apr 29 '20 at 04:20
  • `lxml.etree.iterparse` may be faster, just a guess. How large are these files? Reading into `lxml` instead of `ElementTree` is reasonably efficient. – tdelaney Apr 29 '20 at 04:26

0 Answers0