0

I need to parse a large (>800MB) XML file from Jython. The XML is not deeply nested, containing about a million relevant elements. I need to convert these elements into real objects.

I've used nu.xom.* successfully before, but now that I've switched from Java to Jython, the library fails with the following message:

The parser has encountered more than "64,000" entity expansions in this document; this is the limit imposed by the application.

I have not found a way to fix this, so I probably have to look for another XML library. It could be either Java or Jython-compatible Python and should be efficient. Pythonic would be great, nu.xom.* is simple but not very pythonic. Do you have any suggestions?

clstaudt
  • 21,436
  • 45
  • 156
  • 239

4 Answers4

4

Sax is the best way to parse large documents.

Sounds like you're hitting the default expansion limit. See this note:

http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=4843787

You need to set System property "entityExpansionLimit" to change the default.

(added) see also the answer to this question.

Community
  • 1
  • 1
Steven D. Majewski
  • 2,127
  • 15
  • 16
3

Try using the SAX parser, it is great for streaming large XML files.

DKIT
  • 3,471
  • 2
  • 20
  • 24
  • I have tried `xml.sax`, now I am running into the following error: `xml.sax._exceptions.SAXParseException: :1:1: The parser has encountered more than "64,000" entity expansions in this document; this is the limit imposed by the application.` – clstaudt Feb 23 '11 at 21:10
  • @clstaudt: Why accept an answer that gives you such an error message? – John Machin Feb 24 '11 at 21:26
  • Because the suggestion itself is reasonable, and the error message is a separate issue. Of course, there are other reasonable suggestions now, and there is probably not a single answer to my question. – clstaudt Feb 25 '11 at 10:35
3

Does jython support xml.etree.ElementTree? If so, use the iterparse method to keep your memory size down. Read this and use elem.clear() as described.

John Machin
  • 81,303
  • 11
  • 141
  • 189
  • It appears Jython does not support etree. There is a project [here](http://code.google.com/p/jython-elementtree/) to address this, not sure how current it is. – scorpiodawg Apr 11 '12 at 23:30
0

there is a lxml python library, that can parse large files, without loading data to memory. but i don't know if i jython compatible

Valentin Kantor
  • 1,799
  • 1
  • 23
  • 27