Which libxml2 API should I use for large files?

Question

Our program currently uses the libxml2 DOM API (xmlReadFile) to load an entire file into memory. Unfortunately, this breaks down on "large" XML files, as the basic memory consumption of libxml2 DOM is about 4-5 times the base file size.

It seems libxml2 offers two APIs for reading XML when I don't want to store the whole tree in memory: SAX2 and xmlReader.

I haven't dug into the APIs yet, but I'm wondering which one is preferable under which circumstances?

Note: All I need to do with the XML file is populate some C++ datastructures with the data found in the XML file. And these will in turn be a lot smaller than the (very verbose) XML definition. At the moment, with xmlReadFile and the DOM API the process takes about 100MB memory for a 20MB XML file. The C++ data in memory for such a file is more like 5MB -- so I could go from 1:4 to 4:1, which would already help a lot.

score 1 · Accepted Answer · answered Mar 21 '13 at 15:07

1

I follow this approach, if the processing is sparse (need only an element here and there) xmlReader is better, if you need to process all elements, SAX is better. Although, opinion could come in to play as to whether you want to push the processing or you want the processing to push your code...

answered Mar 21 '13 at 15:07

Lucas

14,227
9
74
124

I have since used xmlReader and I am very satisfied with it. Easy to wrap in a simple helper for the data you have, and no SAX weirdness. – Martin Ba Sep 14 '17 at 07:27

score 1 · Answer 2 · answered Sep 14 '17 at 07:24

If you need to process large XML documents then size becomes the primary consideration. As you saw with 20MB -> 100MB for DOM parsing, if you get much larger than this that can be prohibitively expensive and SAX may be the only way to process it. For embedded or memory constrained devices SAX may be required even for small files.
If you want to start parsing before the file is complete SAX is the way to go. If you are writing a browser, are streaming XML, or require responsiveness then you will need to use SAX.
SAX is more of a pain, if you can get away with DOM parsing that will usually lead to less code and simpler code, for simpler DOM queries you can avoid a state machine for example. If you only care about a handful of fields in the document you could even avoid querying a DOM parser directly and query XSLT instead.

XmlReader is what I ended up using. – Martin Ba Sep 14 '17 at 07:28 — Martin Ba, Sep 14 '17 at 07:28

Which libxml2 API should I use for large files?

2 Answers2