Is there a portable Binary-serialisation schema in FlatBuffers/Protobuf that supports arbitrary 24bit signed integer definitions?

Question

We are sending data over UART Serial at a high data rate so data size is important. The most optimal format is Int24 for our data which may be simplified as a C bit-field struct (GCC compiler) under C/C++ to be perfectly optimal:

#pragma pack(push, 1)
struct Int24
{
    int32_t value : 24;
};
#pragma pack(pop)

typedef std::array<Int24,32> ArrayOfInt24;

This data is packaged with other data and shared among devices and cloud infrastructures. Basically we need to have a binary serialization which is sent between devices of different architecture and programming languages. We would like to use a Schema based Binary serialisation such as ProtoBuffers or FlatBuffers to avoid the client codes needing to handle the respective bit-shifting and recovery of the twos-complement sign bit handling themselves. i.e. Reading the 24-bit value in a non-C language requires the following:

bool isSigned = (_b2 & (byte)0x80) != 0; // Sign extend negative quantities 
int32_t value = _b0 | (_b1 << 8) | (_b2 << 16) | (isSigned ? 0xFF : 0x00) << 24;

If not already existing which (if any) existing Binary Serialisation library could be modified easily to extend support to this as we would be willing to add to any open-source project in this respect.

Thankyou for all the responses. I am assessing the options. – Crog Jan 20 '20 at 17:04 — Crog, Jan 20 '20 at 17:04

score 5 · Accepted Answer · answered Jan 16 '20 at 19:57

5

Depending on various things, you might like to look at ASN.1 and the unaligned Packed Encoding Rules (uPER). This is a binary serialisation that is widely used in telephony to easily minimise the number of transmitted bits. Tools are available for C, C++, C#, Java, Python (I think they cover uPER). A good starting point is Useful Old Technologies.

One of the reasons you might choose to use it is that uPER likely ends up doing better than anything else out there. Other benefits are contraints (on values and array sizes). You can express these in your schema, and the generated code will check data against them. This is something that can make a real difference to a project - automatic sanitisation of incoming data is a great way of resisting attacks - and is something that GPB doesn't do.

Reasons not to use it are that the very best tools are commercial, and quite pricey. Though there are some open source tools that are quite good but not necessarily implementing the entire ASN.1 standard (which is vast). It's also a learning curve, though (at a basic level) not so very different to Google Protocol Buffers. In fact, at the conference where Google announced GPB, someone asked "why not use ASN.1?". The Google bod hadn't heard of it; somewhat ironic, a search company not searching the web for binary serialisation technologies, went right ahead and invented their own...

answered Jan 16 '20 at 19:57

bazza

7,580
15
22

I am liking the learning from this answer and investigating this ASN.1 as a positive answer. – Crog Jan 20 '20 at 17:10
So basically this looks like a great standard but it maybe came about a bit early so isn't widely used outside the pre-opensource systems. The ESA supported compiler sounds great though with C and Java. The main hurdle (as mentioned in the answer) is this Compiler support. We will have Javascript in our mix (Node.js) and I am not able to find uPER support here yet. – Crog Jan 21 '20 at 10:14
Okay, so lack of JavaScript scarpers me a little but the fact I get so many core languages you listed + ADA its workable and within the standard instead of the other options 'not quite' being there. its what ASN.1 was made for really from what I gather! I can hand-roll the JavaScript version of the protocol which I would have to do with any of the other answers but this ASN.1 is taking out a large chunk of maintenance :D – Crog Jan 21 '20 at 11:28
@Crog, Have you considered running a C/C++ ASN.1 toolset inside WASM and interfacing to that? There's an awful lot to the ASN.1 standard, and it's a pretty big undertaking to implement the whole thing oneself. Admittedly, it may well be harder to interface to complex C types in WASM... Also please evaluate tools thoroughly first; the commercial tools are pretty good, but cost a fair amount of money (but the time saved has for me often been of net benefit). There's some pretty good OSS implementations too. – bazza Jan 21 '20 at 22:49
@Crog: Here's some links for starters. The Book http://www.oss.com/asn1/resources/books-whitepapers-pubs/dubuisson-asn1-book.PDF, a commercial tool set https://www.oss.com, another commercial tool set, https://www.obj-sys.com/products/asn1c/pricing.php, a cheap but good commercial tool set https://bellard.org/ffasn1/, a reasonably good OSS project https://github.com/vlm/asn1c. I've used all of those, and they're all reasonable for use. If you need source code, that French toolset comes as source code, the other commercial ones don't (unless you have deep pockets). – bazza Jan 21 '20 at 23:53

score 2 · Answer 2 · answered Jan 16 '20 at 13:38

Protocol Buffers use a dynamically sized integer encoding called varint, so you can just use uint32 or sint32, and the encoded value will be four bytes or less for all values and three bytes or less for any value < 2^21 (the actual size for an encoded integer is ⌈HB/7⌉ where HB is the highest bit set in the value).

Make sure not to use int32 as that uses a very inefficient fixed size encoding (10 bytes!) for negative values. For repeated values, just mark them as repeated, so multiple values will be sent efficiently packed.

syntax = "proto3";

message Test {
  repeated sint32 data = 1;
}

score 2 · Answer 3 · answered Jan 16 '20 at 17:26

FlatBuffers doesn't support 24-bit ints. The only way to represent it would be something like:

struct Int24 { a:ubyte; b:ubyte; c:ubyte; }

which obviously doesn't do the bit-shifting for you, but would still allow you to pack multiple Int24 together in a parent vector or struct efficiently. It would also save a byte when stored in a table, though there you'd probably be better off with just a 32-bit int, since the overhead is higher.

score 2 · Answer 4 · answered Jan 16 '20 at 17:44

One particularly efficient use of protobuf's varint format is to use it as a sort of compression scheme, by writing the deltas between values.

In your case, if there is any correlation between consecutive values, you could have a repeated sint32 values field. Then as the first entry in the array, write the first value. For all further entries, write the difference from the previous value.

This way e.g. [100001, 100050, 100023, 95000] would get encoded as [100001, 49, -27, -5023]. As a packed varint array, the deltas would take 3, 1, 1 and 2 bytes, total of 7 bytes. Compared with a fixed 24-bit encoding taking 12 bytes or non-delta varint taking also 12 bytes.

Of course this also needs a bit of code on the receiving side to process. But adding up the previous value is easy enough to implement in any language.

I love the consideration of var-int for delta compression. Initially I would like to keep things simple but I will balance this in consideration of the extra boilerplate required. — Crog, Jan 20 '20 at 17:10

Is there a portable Binary-serialisation schema in FlatBuffers/Protobuf that supports arbitrary 24bit signed integer definitions?

4 Answers4