1

I am looking for standard protocol that provides ability to serialize set of object (same type) to a file, but also provide easy way to align to object boundary if reader/de-serializer start reading from random byte offset.

After googling I found out that Apache Avro provides this functionality using sync markers, but they don't have c++ lib to provide seek functionality, plus also no native windows library support for c++.

Is there any other well known protocols for the above requirements?

Possible protocols: protobuff and thrift, but after googling looks like they don't provide seeking capabilities (I might be wrong).

user2774767
  • 63
  • 1
  • 6
  • Why would you start reading at a random byte offset? That fundamentally carries risks of false-positives on any sentinel token. What is the scenario here? For example, is the real need the ability to skip to the 427th object without processing the full payload for the first 426? – Marc Gravell Dec 27 '17 at 06:41
  • Lets suppose I have a 100 GB of file that has records. Now I want to process last 100 MB of data. I will do offsetToRead = 100*1024*1024*1024 - 100 * 1024 * 1024, and then sync the offsetToRead to nearest record boundary. – user2774767 Dec 28 '17 at 06:02
  • @MarcGravell This is actually a legit technique. It allows you to dump a huge quantity of variable-width records sequentially and then be able to slice them up into chunks later for e.g. MapReduce without maintaining any sort of index, which is pretty useful. If you use a 128-bit crypto-random sentinel then you won't have false positives. (I remember a story from the early days of Google where I think they used 64-bit sentinels and sure enough started having failures when the search index got big enough...) – Kenton Varda Dec 29 '17 at 04:51
  • @Kenton yeah. That's fair enough. Personally I tend to just either partition the files during storage, or just post-process with a minimal parser (i.e. a protocol reader that knows how to parse lengths and skip entire objects to scroll to the right place). I can see how "start here and find the next sentinel" could make that more efficient, though, for huge files - without the inconvenience of multiple files – Marc Gravell Dec 29 '17 at 09:46
  • @Kenton, is there well known storage serialization format that can be used? – user2774767 Feb 07 '18 at 02:03
  • @user2774767 Sorry, I'm sure there are several libraries available, but I personally don't usually work with such tools so I don't know what's out there. – Kenton Varda Feb 11 '18 at 02:39

0 Answers0