0

My application need to store large amounts of XML-like hierarchical information with the following requirements:

  1. Fast to read
  2. Minimal memory consumption
  3. Typed data instead of merely text

Any suggestions for a binary format that fulfills these goals?

Tony the Pony
  • 40,327
  • 71
  • 187
  • 281
  • Does the storage really need to be binary or are you saying that because you think that binary is 'obviously' more efficient? XML stored as zipped data can be more efficient than many binary formats (such as standard serialized Java). – SteveD Sep 06 '09 at 21:06
  • That's quite an assertion... it might be smaller, but I very doubt doubt that'd be faster. – skaffman Sep 06 '09 at 21:07
  • @Skaffman Does your comment refer to the question, or the previous comment? – KLE Sep 06 '09 at 21:09

6 Answers6

1

Do other applications need to read the stored data, or just yours? Does it need to be a "standard" format?

Fast Infoset meets requirements (1) and (2), although because it's just a binary representation of the XML information model, it's just as untyped as XML. Might be good enough for your purposes, though, in the absence of anything else.

skaffman
  • 398,947
  • 96
  • 818
  • 769
1

There's too little detail in your requirements to give good suggestions. For example are you free to pick your storage medium? Will it be a file system, database or something else?

What does "minimum memory consumption" mean? Are you running on a constrained platform? Must you share resources with other applications? Is a 1GB footprint small enough if your computer has 4GB of memory? Will your data sit in memory or only the parts you are working on?

If the platform was Java, I'd start with its standard serialization and then investigate custom serialization if I wasn't happy with the performance.

SteveD
  • 5,396
  • 24
  • 33
1

If the format is discussable, I'd suggest JSON, not XML. JSON is actually faster to load and write than standard XML.

More about JSON :

http://www.25hoursaday.com/weblog/PermaLink.aspx?guid=060ca7c3-b03f-41aa-937b-c8cba5b7f986 http://www.25hoursaday.com/weblog/PermaLink.aspx?guid=39842a17-781a-45c8-ade5-58286909226b

yoda
  • 10,834
  • 19
  • 64
  • 92
  • Firstly, JSON is not a substitute for XML, it can't represent structures of the same complexity. Secondly, that's quite a performance claim, one which I'd like to see backed up with evidence. – skaffman Sep 06 '09 at 21:20
  • Id' like to know more about the "structures with the same complexity" that JSOn can't handle as well. – yoda Sep 06 '09 at 21:23
  • XML attributes, for one, XML namespaces for another. JSON is just a simple nested-key-value map. – skaffman Sep 06 '09 at 21:32
  • JSON is built for data structures (better than XML) .. Calling it "just a simple nested-key-value map" is like calling XML a "popular" way of writting semantic code - doesn't makes sense and ain't honest about its benefits. JSON doesn't have namespaces for now, that's true and due to the divised community when questioned about that implementation. About the XML attributes, if you could be more specific, I'd appreciate. – yoda Sep 06 '09 at 21:44
  • +1 for suggesting something other than xml. namespaces are overrated for data storage, and attributes don't add any real benefits in data complexity. anything you can do with attributes you can do with a tag. it just makes it more compact in the markup. – Jeremy Wall Sep 07 '09 at 00:41
1

You could also read the XML into an object graph and store as Google Protocol Buffers. These are designed to be very efficient.

Fortyrunner
  • 12,702
  • 4
  • 31
  • 54
1

you don't specify if xml is a format requirement you only say it needs to be hierarchical like xml.

Without more detail on the kind of data it's hard to give you very much advice. So here's a small list.

  • b-trees there are a number of libraries supporting b-tree storage formats in mulitiple languages. they have fast lookups and are hierarchical in nature.
  • Protocol-Buffers from google. Compact storage optimized for sending over the wire. Not neccessarily optimized as a storage format though. They are typed though and probably will do pretty well as a storage format.
  • Zipped text formats. compact, and depending on the format chosen typed and hierarchical in nature.
    • YAML (supporting for some complex typing, hierarchical, human readable)
    • JSON (less typing support, fast parsing, hierarchical, human readable)
Jeremy Wall
  • 23,907
  • 5
  • 55
  • 73
1

Wikipedia's explanation of the issue: http://en.wikipedia.org/wiki/Binary_XML

Supposedly the recommended organisation and its java and .net sdk can be downloaded from: http://www.agiledelta.com/product_efx.html

xml is pure text but can be used to represent serialized objects. Let's presume your serializer is serializing your objects into xml.

You should not try to convert your objects into binary streams because you would have to tackle endian (http://en.wikipedia.org/wiki/Endian) and data-representation issues. However, if you insist, you would need to use XDR (http://en.wikipedia.org/wiki/External_Data_Representation) for its data architecture neutrality.

Otherwise, you should serialize your objects to XML using standard serializers and then convert the xml to binary/compact xml because of the availability of libraries and sdks. And then deserialize by decompacting from binary xml.

Blessed Geek
  • 21,058
  • 23
  • 106
  • 176