I have seen many questions and answers regarding reading large XML files, but none of them seem to have provided true multi-core performance to a python process parsing XML. I have just started using an excellent framework for managing multiprocessing in python called ray, and I was curious if I could implement it to parse large XML files. Currently I am using ElementTree.iterparse
, which was intuitive to implement but can take upwards of an hour to read in a large file in its serialized format. I'm not exactly sure where to begin to maximize disk bandwidth nor what software already exists to assist this.
Note that in my case there are no dependencies in the data. Each element of the root element of the XML file is independent entity.