19

I am starting work on a new piece of software that will end up needing some robust and expandable file IO. There are a lot of formats out there. XML, JSON, INI, etc. However, there are always plusses and minuses so I thought I would ask for some community input.

Here are some rough requirements:

  1. The format is a "standard"...I don't want to reinvent the wheel if I don't have to. It doesn't have to be a formal IEEE standard, but something you could Google and get some information on as a new user, may have some support tools (editors) beyond vi. (Though the software users will generally be computer savvy and happy to use vi.)
  2. Easily integrates with C++. I don't want to have to pull along a 100mb library and three different compilers to get it up and running.
  3. Supports tabular input (2d, n-dimensional)
  4. Supports POD types
  5. Can expand as more inputs are required, binds well to variables, etc.
  6. Parsing speed is not terribly important
  7. Ideally, as easy to write (reflect) as it is to read
  8. Works well on Windows and Linux
  9. Supports compositing (one file referencing another file to read, and so on.)
  10. Human Readable

In a perfect world, I would use a header-only library or some clean STL implementation, but I'm fine with leveraging Boost or some small external library if it works well.

So, what are your thoughts on various formats? Drawbacks? Advantages?

Edit

Options to consider? Anything else to add?

  • XML
  • YAML
  • SQLite
  • Google Protocol Buffers
  • Boost Serialization
  • INI
  • JSON
mvp
  • 111,019
  • 13
  • 122
  • 148
DigitalInBlue
  • 299
  • 3
  • 10
  • 3
    What kind(s) of data do you want to represent? English-only, or i18n an issue? How important is it to have a compact representation? Need to interoperate with other languages? – vonbrand Feb 05 '13 at 03:46
  • For a text format: XML - *with* the appropriate libraries. INI fails hierarchically/dimensional data and JSON, while darn nice to use from JavaScript, lacks some nice features and tooling of XML. –  Feb 05 '13 at 03:56
  • C++ does not directly support structured files. You always need a piece of code to parse or generate something more complex than a pure binary or text file. XML/JSON are a fine choice and you can extend them by giving special meanings to things. You must realize, however, that when you have a lot of data, any format becomes poorly readable in a plain text editor. – Alexey Frunze Feb 05 '13 at 04:01
  • @vonbrand - Yes. English only. The data is going to be configuration data, but primarily numeric in nature. The most complex data would be tables on the order of 1024x1024x1024 elements. – DigitalInBlue Feb 05 '13 at 13:03
  • Some variant of INI is perhaps enough? Depends on how complex the data structure is... – vonbrand Feb 05 '13 at 13:21
  • The more I research this, the more I see the "array" requirement as the long pole in the tent. JSON & Boost property trees do not offer good support here for arbitrarily sized arrays. – DigitalInBlue Feb 05 '13 at 15:22
  • Try "cereal" : http://uscilab.github.io/cereal/ works fine – Erik Aronesty Mar 11 '15 at 14:37

4 Answers4

19

There is one excellent format that meets all your criteria:

SQLite!

Please read article about using SQLite as an application file format. Also, please watch Google Tech Talk by D. Richard Hipp (SQLite author) about this very topic.

Now, lets see how SQLite meets your requirements:

The format is a "standard"

SQLite has become format of choice for most mobile environments, and for many desktop apps (Firefox, Thunderbird, Google Chrome, Adobe Reader, you name it).

Easily integrates with C++

SQLite has standard C interface, which is only one source file and one header file. There are C++ wrappers too.

Supports tabular input (2d, n-dimensional)

SQLite table is as tabular as you could possibly imagine. To represent say 3-dimensional data, create table with columns x,y,z,value and store your data as a set of rows like this:

x1,y1,z1,value1
x2,y2,z2,value2
...

Supports POD types

I assume by POD you meant Plain Old Data, or BLOB. SQLite lets you store BLOB fields as is.

Can expand as more inputs are required, binds well to variables

This is where it really shines.

Parsing speed is not terribly important

But SQLite speed is superb. In fact, parsing is basically transparent.

Ideally, as easy to write (reflect) as it is to read

Just use INSERT to write and SELECT to read - what could be easier?

Works well on Windows and Linux

You bet, and all other platforms as well.

Supports compositing (one file referencing another file to read)

You can ATTACH one database to another.

Human Readable

Not in binary, but there are many excellent SQLite browsers/editors out there. I like SQLite Expert Personal on Windows and sqliteman on Linux. There is also SQLite editor plugin for Firefox.


There are other advantages that SQLite gives you for free:

  • Data is indexable which makes it very fast to search. You just cannot do this using XML, JSON or any other text-only formats.

  • Data can be edited partially, even when amount of data is very large. You do not have to rewrite few gigabytes just to edit one value.

  • SQLite is fully transactional: it guarantees that your data is consistent at all times. Even if your application (or whole computer) crashes, your data will be automatically restored to last known consistent state on next first attempt to connect to the database.

  • SQLite stores your data verbatim: you do not need to worry about escaping junk characters in your data (including zero bytes embedded in your strings) - simply always use prepared statements, that's all it takes to make it transparent. This can be big and annoying problem when dealing with text data formats, XML in particular.

  • SQLite stores all strings in Unicode: UTF-8 (default) or UTF-16. In other words, you do not need to worry about text encodings or international support for your data format.

  • SQLite allows you to process data in small chunks (row by row in fact), thus it works well in low memory conditions. This can be a problem for any text based formats, because often they need to load all text into memory to parse it. Granted, there are few efficient stream-based XML parsers out there, but in general any XML parser will be quite memory greedy compared to SQLite.

mvp
  • 111,019
  • 13
  • 122
  • 148
8

Having worked quite a bit with both XML and json, here's my rather subjective opinion of both as extendable serialization formats:

  • The format is a "standard": Yes for both
  • Easily integrates with C++: Yes for both. In each case you'll probably wind up with some kind of library to handle it. On Linux, libxml2 is a standard, and libxml++ is a C++ wrapper for it; you should be able to get both of those from your distro's package manager. It will take some small effort to get those working on Windows. There appears to be some support in Boost for json, but I haven't used it; I've always dealt with json using libraries. Really, the library route is not very onerous for either.
  • Supports tabular input (2d, n-dimensional): Yes for both
  • Supports POD types: Yes for both
  • Can expand as more inputs are required: Yes for both - that's one big advantage to both of them.
  • Binds well to variables: If what you mean is some way inside the file itself to say "This piece of data must be automatically deserialized into this variable in my program", then no for both.
  • As easy to write (reflect) as it is to read: Depends on the library you use, but in my experience yes for both. (You can actually do a tolerable job of writing json using printf().)
  • Works well on Windows and Linux: Yes for both, and ditto Mac OS X for that matter.
  • Supports one file referencing another file to read: If you mean something akin to a C #include, then XML has some ability to do this (e.g. document entities), while json doesn't.
  • Human readable: Both are typically written in UTF-8, and permit line breaks and indentation, and thus can be human-readable. However, I've just been working with a 479 KB XML file that's all on one line, so I had to run it through a prettyprinter to make sense of it. json can also be pretty unreadable, but in my experience is often formatted better than XML.

When starting new projects, I generally prefer json; it's more compact and more human-readable. The main reason I might select XML over json would be if I were worried about receiving badly-formed documents, since XML supports automated document format validation, while you have to write your own validation code with json.

Bob Murphy
  • 5,814
  • 2
  • 32
  • 35
  • That is good information. I have mainly used XML in the past, but found table input awkward. "Supports one file referencing another file to read: If you mean something akin to a C #include, then XML has some ability to do this (e.g. document entities), while json doesn't." That is good to know, surprising, and probably rules out JSON. – DigitalInBlue Feb 05 '13 at 13:07
  • Some benefits I find of using XML over JSON: Schemas/validators/infosets, XPath/XQuery, database (e.g. SQL Server) support, namespaces, attributes *and* child elements. And no, no, I'm not going to say XSLT 1.0 - that can go away forever :D JSON is nice for REST, Web-services, JavaScript-integration, and "small documents". However, this simplicity comes at a cost of standardized advanced features and tooling - not all "Enterprise" stuff has to be hard to use, even if many products based on XML are. –  Feb 05 '13 at 19:18
  • @pst: Correction about json readability made. – Bob Murphy Feb 05 '13 at 19:26
4

Check out google buffers. This handles most of your requirements.

From their documentation, the high level steps are:

Define message formats in a .proto file.
Use the protocol buffer compiler.
Use the C++ protocol buffer API to write and read messages.

David D
  • 1,571
  • 11
  • 12
  • Do you have experience with Google proto buffers? My sense was that it was difficult to expand messages and maintain backward compatibility, but I have not implemented any solutions with them to know how easy they are to maintain for a large project with potentially large inputs (from dozens to hundreds of files composing one input set). – DigitalInBlue Feb 05 '13 at 13:10
  • Looks like proto buffers do not support n-dimensional arrays. [link](http://stackoverflow.com/questions/6825196/protocol-buffers-store-an-double-array-1d-2d-and-3d) – DigitalInBlue Feb 05 '13 at 13:41
  • 2
    @DigitalInBlue I do have some experience with google buffers. I used them in an R&D project to control a piece of hardware remotely. [Updating A Message Type, search for section with the same name,](https://developers.google.com/protocol-buffers/docs/proto) is pretty simple, but requires some though, I would consider not using the required keyword as often as you can get away with it. As for N-dimensional arrays, the link you provided spells out a pretty good way to emulate them. Why would you not want to do it as they suggested? – David D Feb 05 '13 at 15:37
  • thanks for the info. No reason I wouldn't do as was suggested, just looking to try to understand the ins and outs of various standards before a decision is made. The "emulation" in proto buffers isn't ideal, but perhaps not a deal breaker. – DigitalInBlue Feb 05 '13 at 16:44
-6

For my purposes, I think the way to go is XML.

  1. The format is a standard, but allows for modification and flexibility for the schema to change as the program requirements evolve.
  2. There are several library options. Some are larger (Xerces-C) some are smaller (ezxml), but there are many options, so we won't be locked in to a single provider or very specific solution.
  3. It can supports tabular input (2d, n-dimensional). This requires more parsing work on "our" end, and is likely the weakest point for XML.
  4. Supports POD types: Absolutely.
  5. Can expand as more inputs are required, binds well to variables, etc. through schema modifications and parser modifications.
  6. Parsing speed is not terribly important, so processing a text file or files is not an issue.
  7. XML can be programmatically written just as easily as read.
  8. Works well on Windows and Linux or any other OS that supports C and text files.
  9. Supports compositing (one file referencing another file to read, and so on.)
  10. Human Readable with many text editors (Sublime, vi, etc.) supporting syntax highlighting out of the box. Many web browsers display the data well.

Thanks for all the great feedback! I think if we wanted a purely binary solution, Protocol Buffers or boost::serialization is likely the way that we would go.

DigitalInBlue
  • 299
  • 3
  • 10
  • 4
    Though I did not not downvote this, I'll comment: At SO, since your answer does not offer solution not covered by other answers, it is kind of bad form to write what you did as answer and accept that. Better to accept the best answer of somebody else (you can change that still I think), unless none of them is any good... Also unless your answer actually adds something useful (this does not), it's better to write your chosen solution as comment. – hyde Feb 09 '13 at 20:24
  • 4
    I did downvote it. To write your own answer, effectively cribbing from other people's answers, and then accept it as the answer, is really bad form and it works against the incentive system here. – Bill Weinman Nov 08 '13 at 23:56