1

I am trying to parse large xml files using BeutifulSoup in python. [the files are around 200MB in size]. Using the lxml parser, Beutifulsoup still takes a long time to parse the files (maybe 5mins), and i am looking to cache the soup to allow it to be re-loaded quickly in the future.

I would normally use pickle to dump variables for re-loading later, however am receiving recursion errors. I have tried increasing the recursion limit as per Hitting Maximum Recursion Depth Using Python's Pickle / cPickle , initially to 10,000 then to 100,000. Unfortunately when set to the higher value, this crashes python, as is the danger with the higher recursion limits.

Is there an alternative way to dump a variable, which would allow the file to be quickly read back into to Python?

Community
  • 1
  • 1
kyrenia
  • 5,431
  • 9
  • 63
  • 93
  • What happens when you use json (although I feel it might be slower than pickle and you'll have to convert XML structures to JSON)? – UltraInstinct Jun 30 '16 at 20:25
  • I'm guessing your performance problem for lxml is due to the recursion as well. Can you try and simplify the data structure somehow, e.g. splitting it into parts? – MisterMiyagi Jun 30 '16 at 20:44

0 Answers0