Considerations for binary seralizations (Protobuf, CBOR, MessagePack, etc.) for a long term archive data format

Question

In discussions for a next generation scientific data format a need for some kind of JSON-like data structures (logical grouping of fieldshas been identified. Additionally, it would be preferable to leverage an existing encoding instead of using a custom binary structure. For serialization formats there are many options. Any guidance or insight from those that have experience with these kinds of encodings is appreciated.

Requirements: In our format, data need to be packed in records, normally no bigger than 4096-bytes. Each record must be independently usable. The data must be readable for decades to come. Data archiving and exchange is done by storing and transmitting a sequence of records. Data corruption must only effect the corrupted records, leaving all others in the file/stream/object readable.

Priorities (roughly in order) are:

stability, long term archive usage
performance, mostly read
ability to store opaque blobs
size
simplicity
broad software (aka library) support
stream-ability, transmitted and readable as a record is generated (if possible)

We have started to look at Protobuf (Protocol Buffers RFC), CBOR (RFC) and a bit at MessagePack.

Any information from those with experience that would help us determine the best fit or, more importantly, avoid pitfalls and dead-ends, would be greatly appreciated.

Thanks in advance!

Frankly any of those would be fine. – Marc Gravell Jul 19 '17 at 18:54 — Marc Gravell, Jul 19 '17 at 18:54

kert · Answer 1 · 2017-12-30T19:55:38.527

5

Late answer but: You may want to decide if you want a schema-based or self-describing format. Amazon Ion overview talks about some of the pros and cons of these design decisions, plus this other ION ( completely different from Amazon Ion ).

Neither of those fully meet your criteria, But these articles should list a few criteria you might want to consider. Obviously actually being a standard and being adopted are far higher guarantees of longevity than any technical design criteria

edited Dec 30 '17 at 19:55

answered Dec 30 '17 at 19:50

kert

2,161
21
22

ION renamed itself to RION. See http://tutorials.jenkov.com/rion/index.html#rion-by-nanosai – David J. Jul 11 '21 at 22:17

score 3 · Answer 2 · edited Jul 11 '21 at 22:13

Your goal of recovery from data corruption almost certainly something that should be addressed in a separate architectural layer from the matter of encoding of the records. How many records to pack in to a blob/file/stream is really more related to how many records you can afford to sequentially read through before finding the one you might need.

An optimal solution to storage corruption depends on what kind of corruption you consider likely. For example, if you store data on spinning disks your best protection might be different from if you store data on tape. But the details of that are really not an application-level concern. It's better to abstract/outsource that sort of concern.

Modern cloud-based data storage services provide extremely robust protection against corruption, measured in the industry as "durability". For example, even the Microsoft Azure's lowest-cost storage option, Locally Redundant Storage (LRS), stores at least three different copies of any data received, and maintains at least that level of protection for as long as you want. If any copy gets damaged, another is made from one of undamaged ones ASAP. That results in an annual "durability" of 11 nines (99.999999999% durability), and that's the "low-cost" option at Microsoft. The normal redundancy plan, Geo Redundant Storage (GRS), offers durability exceeding 16 nines. See Azure Storage redundancy.

According to Wasabi, eleven-nines durability means that if you have 1 million files stored, you might lose one file every 659,000 years. You are about 411 times more likely to get hit by a meteor than losing a file.

P.S. I previously worked on the Microsoft Azure Storage team, so that's the service I know the best. However, I trust that other cloud-storage options (e.g. Wasabi and Amazon's S3) offer similar durability protection, e.g. Amazon S3 Standard and Wasabi hot storage are like Azure LRS: eleven nines durability. If you are not worried about a meteor strike, you can rest assured you these services won't lose your data anytime soon.

Good answer. If you don't mind a little nitpick, you mean disclosure, not disclaimer, and lose, not loose. — Jerry101, Aug 29 '18 at 05:40

Considerations for binary seralizations (Protobuf, CBOR, MessagePack, etc.) for a long term archive data format

2 Answers2