parsing large XML file in Java

Question

Possible Duplicate:
Looping over a large XML file

What is a better way to parse large XML data which is essentially a collection of XML data in Java and Java based frameworks? We get data from a webservice call which runs into few MB (typically 25MB+). This data essentially corresponds to an unmarshalled list of Objects. My objective is to create the list of objects from the XML.

I tried using the SAX parser and it takes a good 45 seconds to parse these 3000 objects.

What are the other recommended approaches?

Have you used a profiler? Is the problem in your code or in the XML library you are using? SAX is quite lightweight. Try, however, the WoodStox STaX parser if you feel compelled to try something different. — bmargulies, May 09 '12 at 20:04

Mattias Isegran Bergander · Answer 1 · 2012-05-09T20:04:59.420

2

Try pull parsing instead, use StAX? First search hit on comparing: http://docs.oracle.com/cd/E17802_01/webservices/webservices/docs/1.6/tutorial/doc/SJSXP2.html

Have you profiled and seen where the bottlenecks are?

StAX is built into java (since java 6), but some recommend the woodstox StAX implementation for even better performance. I have not tried it though. http://woodstox.codehaus.org/

edited May 09 '12 at 20:04

answered May 09 '12 at 19:58

Mattias Isegran Bergander

11,811
2
41
49

That isn't going to solve the problem on its own. You might find Woodstox is 10% faster but this isn't a 10% problem. – Michael Kay May 10 '12 at 08:37

score 1 · Answer 2 · edited May 23 '17 at 12:27

I tried using the SAX parser and it takes a good 45 seconds to parse these 3000 objects. What are the other recommended approaches?

There are only the following options:

DOM  
SAX  
StAX

SAX is the fastest SAXvsDOMvsStax so if you switch to different style, I don't think you'll get any benefit.
Unless you are doing something wrong now
Of course there are also the marshalling/demarshalling frameworks such as JAXB etc but IMO (not done any measurements) they could be slower since the add an extra layer of abstraction on the XML processing

score 0 · Answer 3 · answered May 09 '12 at 20:05

SAX doesn't provide random access to the structure of the XML file, this means that SAX provides a relatively fast and efficient method of parsing. Because the SAX parser deals with only one element at a time, implementations can be extremely memory-efficient, making it often the one choice for dealing with large files.

score 0 · Answer 4 · answered May 10 '12 at 08:41

Parsing 25Mb of XML should not take 45 seconds. There is something else going on. Perhaps most of the time is spent waiting for an external DTD to be fetched from the web, I don't know. Before changing your approach, you need to understand where the costs are coming from and therefore what part of the system will benefit from changes.

However, if you really do want to convert the XML into Java objects (not the application architecture I would choose, but never mind), then JAXB sounds a good bet. I haven't used JAXB much since I prefer to stick with XML-oriented languages like XSLT and XQuery, but when I did try JAXB I found it pretty fast. Of course it uses a SAX or StAX parser underneath.

parsing large XML file in Java

4 Answers4