At the moment we are using ProtocolBuffers to exchange data between python and C++. However, we are running into the maximum filesize limitation of protocol buffers and are considering switching everything to Cap'n Proto. However, since it is somewhat related to protocol buffers, I was wondering if Cap'n Proto too has a limitation wrt to the maximum filesize?
-
1what sort of size are we talking about? note that protobuf *can* go much larger than the official recommended ideas. capnproto is designed to be much friendlier to large data, with a multi-slab file layout, but it doesn't have as wide support - I guess it depends whether that matters to you – Marc Gravell Jan 26 '18 at 09:54
-
I am already using the CodedInputStream to read larger files. But from what I understand protobuffers have a hard limit at 2GB. If possible I would like to have files even larger than that. – user823255 Jan 26 '18 at 10:20
-
1there is no hard limit imposed by the protocol (well, there *is*, but it involves 64 bit numbers, so it isn't going to be a problem). If a *specific implementation* imposes a hard limit, that could be a problem - granted. There's a [hack suggested here](https://stackoverflow.com/a/13849827/23354) on how to get around it for the C++ version. I run one of the C# versions, and I've helped people work with much bigger files than 2GiB. – Marc Gravell Jan 26 '18 at 10:26
-
thanks pointing that out for me. While we compile the library ourselves and could adapt the `kDefaultTotalBytesLimit` I don't feel 100% comfortable with adjusting the library in our build process. Do you happen to know if Cap'n proto can handle bigger file sizes out of the box ? – user823255 Jan 26 '18 at 12:24
-
1I *believe* it can, yes; I wrote some capnp tooling a few years back, but I haven't touched it in a while – Marc Gravell Jan 26 '18 at 13:27
1 Answers
Cap'n Proto has a maximum file size of approximately 2^64 bytes, or 16 exbibytes -- which "should be enough for anyone". :)
Cap'n Proto is in fact an excellent format for extremely large data files, because it supports random access and lazy loading. When reading a huge Cap'n Proto file, I recommend using mmap()
to map the file into memory, then passing the bytes directly to the Cap'n Proto implementation (e.g. capnp::FlatArrayMessageReader
in C++). This way, only the pages of the file that you actually use will be brought into memory by the operating system. (In contrast, with Protocol Buffers, it is necessary to parse the entire file upfront into in-memory data structures before you can access any of it.)
Note that an individual List
value in a Cap'n Proto structure has a limit of 2^29-1 elements. Text
and Data
(strings and byte blobs) are special kinds of lists, so this implies that any single contiguous text or byte blob is limited to 512MB. However, you can have multiple such blobs, so larger data can be stored into a single file by splitting it into pieces.
Note also that most Cap'n Proto implementations by default impose a "traversal limit" when reading a Cap'n Proto structure in order to defend against malicious data containing pointer loops. Typically this defaults to 64MiB. For larger data, you'll want to override the limit -- in C++, you'll want to pass a custom ReaderOptions
to the MessageReader
constructor.

- 41,353
- 8
- 121
- 105
-
Is a list of lists (i.e. `List(List(Data))`) sufficient to get around the 512MB limit or do these need to be explicitly separate fields? – Vitali Dec 02 '20 at 15:45
-
@Vitali `List(Data)` should be sufficient. Each `Data` in the list will be able to be 512MB, and the outer list can have 2^29 - 1 `Data`s in it. – Kenton Varda Dec 04 '20 at 17:40
-
Yeah I wrote a test to confirm that. I did find that `writeMessageToFd` with a list > 2 GB on macos throws an exception because write/writev on that platform appears to be limited to a signed integer but I haven't gotten a chance to report an issue for it (stupidly didn't save the test case code to make filing the issue easier). – Vitali Dec 05 '20 at 20:06
-
@KentonVarda Do you have a reference to _why_ the 2^29-1 element limit exists? We're trying to use capnp to store many GB of 3D point cloud data, and having to split lists is quite inconvenient. So I wonder how that limit came into being. Especially because canpn does away with protobuf's general 32-bit limit, it's a bit odd to find another similar limit remain in parts of capnp. – nh2 Feb 13 '21 at 16:57
-
1@nh2 It's because of the way list pointers are encoded: https://capnproto.org/encoding.html#lists Part "D" of the pointer encoding is 29 bits. As you can see, all the bits of the pointer are account for, so to allow larger sizes, we would instead have to store the length separately (maybe as a prefix on the value?), which would either increase overhead for small lists or would significantly complexify the spec by having multiple pointer encodings. In retrospect more complexity may have been worth it, but it's hard to change now without breaking compatibility. – Kenton Varda Feb 14 '21 at 18:39