0

I'm working on a project that needs to parse a very big XML file (about 10GB). Because process time is really long (about days), It's possible that my code exit in the middle of the process; so I want to save my code's status once in a while and then be able to restart it from last save point.

Is there a way to start (restart) a SAX parser not from the beginning of a XML file?

P.S: I'm programming using Python, but solutions for Java and C++ are also acceptable.

mmohaveri
  • 528
  • 7
  • 23

1 Answers1

1

Not really sure if this answers your question, but I would take a different approach. 10GB is not THAT much data, so you could implement a two-phase parsing.

Phase 1 would be to split the file in smaller chunks based on some tag, so you end up with more smaller files. For example if your first file is A.xml, you split it to A_0.xml, A_1.xml etc.

Phase 2 would do the real heavy lifting on each chuck, so you invoke it on A_0.xml, then after that on A_1.xml etc. You could then restart on a chunk after your code has exitted.

Rob Audenaerde
  • 19,195
  • 10
  • 76
  • 121