2

My requirement is to compress xml file into a binary format, transmit it and decompress it (lightening fast) before i start parsing it.

There are quite a few binary xml protocols and tools available. I found EXI (efficient xml interchange) better as compared to others. Tried its open source version Exificient and found it good.

I heard about google protocol buffers and facebook's thrift, can any one tell me if these two can do the job i am looking for?

OR just let me know if there is anything better then EXI i should look for.

Also, There is a good XML parser VTD-XML (haven't tried myself, just googled about it and read some articles) that accomplishes better parsing performances as compared to DOM,SAX and Stax.

I want best of both worlds, best compression + best parsing performance, any suggestions?

One more thing regarding EXI, how can EXI claim to be fast at parsing a decoded XML file? Because it is being parsed by DOM, SAX or STax? I would have believed this to be true if there was another binary parser for reading the decoded version. Correct me if i am wrong.

ALSO, is there any good C++ open source implementation for EXI format? A version in java is available by EXIficient, but i am not able to spot a C++ open source implementation?

There is one by agile delta but that's commercial.

skaffman
  • 398,947
  • 96
  • 818
  • 769
Nadeem
  • 75
  • 2
  • 11

3 Answers3

3

You mention protocol buffers (protobuf); this is a binary format, but has no direct relationship to XML. In partiular, no member-names (element names / attribute names / namespaces) are encoded - it is just the data (with numeric markers for identifiers).

As such, you cannot reconstruct arbitrary XML from a protobuf stream unless you already know how to map "field 3" etc.

However! If you have an object-model that works with both XML and protobuf, the transform is trivial; deserialize with either - serialize with either. How well this works depends on the implementation; for example, it is trivial with protobuf-net and is actually how I do the codegen (load the binary; write as XML; run the XML through an xslt layer to emit code).

If you actually just want to transfer object data (and XML is just a proposed implementation detail), then I thoroughly recommend protobuf; platform independent, a wide range of implementations, version-tolerant, very small output, and very fast processing at both read and write.

Marc Gravell
  • 1,026,079
  • 266
  • 2,566
  • 2,900
  • Great, that answers my questions about proto buffs and i believe thrift would be no different. Thanks alot for your answer. – Nadeem May 04 '11 at 18:15
  • Now i am looking for someone to clear my confusing regarding EXI - efficient xml interchange. – Nadeem May 04 '11 at 18:18
3

Nadeem,

These are very good questions. You might be new to the domain, but the same questions are frequently asked by XML veterans. I'll try to address each of them.

I heard about google protocol buffers and facebook's thrift, can any one tell me if these two can do the job i am looking for?

As mentioned by Marc, Protocol Buffers and Thrift are binary data formats, but they are not XML formats designed to transport XML data. E.g., they have no support for XML concepts like namespaces, attributes, etc., so mapping between XML and these binary formats would require a fair bit of work on your part.

OR just let me know if there is anything better then EXI i should look for.

EXI is likely your best bet. The W3C completed a pretty thorough analysis of XML format implementations and found the EXI implementation (Efficient XML) consistently achieved the best compactness and was one of the fastest. They also found it consistently achieved better compactness than GZIP compression and even packed binary formats like ASN.1 PER (see W3C EXI Evaluation). None of the other XML formats were able to do that. In the tests I've seen comparing EXI with Protocol Buffers, EXI was at least 2-4 times smaller.

I want best of both worlds, best compression + best parsing performance, any suggestions??

If it is an option, you might want to consider the commercial products. The W3C EXI tests mentioned above used Efficient XML, which is much faster than EXIficient (sometimes >10 times faster parsing and >20 times faster serializing). Your mileage may vary, so you should test it yourself if it is an option.

One more thing regarding EXI, how can EXI claim to be fast at parsing a decoded XML file?

The reason EXI can be smaller and faster to parse than XML is because EXI can be streamed directly to/from memory via the standard XML APIs without ever producing the data in an intermediate XML format. So, instead of serializing your data as XML via a standard API, compressing the XML, sending the compressed XML, decompressing the XML on the other end, then parsing it through one of the XML APIs, ... you can serialize your data directly as EXI via a standard XML API, send the EXI, then parse the EXI directly through one of the XML APIs on the other side. This is a fundamental difference between compression and EXI. EXI is not compression per-se -- it is a more efficient XML format that can be streamed directly to/from your application.

Hope this helps!

  • BTW: The [Efficient XML FAQ](http://www.agiledelta.com/efx-faq.pdf) might also be helpful to you. – John Schneider May 14 '11 at 22:21
  • John, Thanks a lot for your detailed response. Really appreciate it. – Nadeem May 27 '11 at 15:01
  • John, After all my work related to binary xml, (EXI being major part of it), i want to publish my work in some famous journals. Since i haven't invented anything new, can you please point me to most specific research journals related to XML where i can send my article for publication. Your response would be highly appreciated. – Nadeem Jul 26 '11 at 06:38
0

Compression is unified with the grammar system in EXI format. The decoder API generally give you a sequence of events such as SAX events when you let decoders process EXI streams, however, decoders are not internally converting EXI back into XML text to feed into another parser. Instead, the decoder does all the convoluted decompression/scanning process to yield an API event sequence such as SAX. Because EXI and XML are compatible at the event level, it is fairly straightforward to write out XML text given an event sequence.

takuki
  • 124
  • 1
  • 5
  • just to confirm my understanding, you are saying that there is another decoding parser by the EXI API's that just parse the decoded version just like SAX events? – Nadeem May 05 '11 at 14:45
  • There is no another decoding parser involved. EXI decoder does all the convoluted process (integrated decompression and reverse-tokenization) within itself, of which the output are the events such as SAX. – takuki May 05 '11 at 16:00