-1

I am investigating migrating of a highly customized and efficient binary format to one of the available binary formats. The data is stored on some low powered mobile among other places, so performance is important requirement. Advantage of the current format is that all strings are stored in a pool. This means that we don't repeat the same string hundred of times in file, we read it only once during deserialization and all objects are referencing it by its index. It also means that we keep only one copy in memory. So a lot of advantages :) I was not able to find a way for capnproto or flatbuffers to support this. Or would I need to build layer on top, and in generated object use integer index to strings explicitly?

Thanks you!

MichalMa
  • 1,230
  • 6
  • 14

2 Answers2

1

FlatBuffers supports string pooling. Simply serialize a string once, then refer to that string multiple times in other objects. The string will only occur in memory once.

Simplest example, schema:

table MyObject { name: string; id: string; }

code (C++):

FlatBufferBuilder fbb;
auto s = fbb.CreateString("MyPooledString");
// Both string fields point to the same data:
auto o = CreateMyObject(fbb, s, s);
fbb.Finish(o);
Aardappel
  • 5,559
  • 1
  • 19
  • 22
  • Did I miss information about this in their documentation? – MichalMa Jan 24 '16 at 19:29
  • It is implicit in the "internals" part, but could probably be more clear, yes. – Aardappel Jan 24 '16 at 23:02
  • @Aardappel I guess Flatbuffers does not defend against [amplification attacks](https://capnproto.org/encoding.html#amplification-attack)? – Kenton Varda Jan 25 '16 at 06:11
  • @kenton-varda : it does actually. It has a verifier where you can specify the max object visited that is still legal. It allows no cycles by design (unsigned offsets). – Aardappel Jan 25 '16 at 16:56
  • Hmm, so basically you need to estimate how many times an object might be reused, rather than relying on the total size of the message to estimate how much data it contains. I guess that makes some sense. – Kenton Varda Jan 25 '16 at 20:28
1

You can always do this manually like:

struct MyMessage {
  stringTable @0 :List(Text);

  # Now encode string fields as integer indexes into the string table.
  someString @1 :UInt32;
  otherString @2 :UInt32;
}

Cap'n Proto could in theory allow multiple pointers to point at the same object, but currently prohibits this for security reasons: it would be too easy to DoS servers that don't expect it by sending messages that are cyclic or contain lots of overlapping references. See the section on amplification attacks in the docs.

Kenton Varda
  • 41,353
  • 8
  • 121
  • 105