1

I'm currently decoding a binary file format that we're using in order to do some internal analyses on the contained data. It mostly stores arrays of integers or doubles, which it stores using some kind of compression algorithm. As I'm reverse engineering their compression algorithms, I'm starting to wonder if I'm reinventing the wheel. Is this a well known compression algorithm that I may even find an existing C# library for reading and writing, or is this something completely home made?

Here are some examples:

The bytes [1, 0, 31] encode the integer array [31]. The first byte (1) says that the array consists of one number. The second byte (0) says that the numbers are listed one by one. The third byte (31) is the listed number. (As long as it's less than 255, it's written as a single byte.)

The bytes [1, 128, 31] encodes the same array, but the second byte (128) means that the next byte should be repeated 1 time (the first byte). This allows us to compress arrays by listing sequential identical numbers as one, for example [5, 128, 31] which encodes the array [31, 31, 31, 31, 31]

Some more examples:

[5, 0, 1, 2, 3, 4, 5] => [1, 2, 3, 4, 5]
[255, 128, 31] => [31, 31, 31, ..., 31, 31, 31] // (an array of 255 31's)

When the length of the array is higher than 255, it's added to the second byte as multiples of 256.

[5, 1, 31] => [31, 31, 31, ..., 31, 31, 31] // (an array of 5 + 256*1 = 261 31's)

This works for both normal and repeating arrays:

[5, 1, 1, 2, 3, 4, 5, ..., 260, 261] => [1, 2, 3, ..., 260, 261] // (an array with the numbers from 1 to 261)

When any numbers in the array are larger than 255, the numbers are split into factors of 256.

[1, 0, 4, 1, 0, 3] => [4 + 256 * 3] => [772]
[3, 0, 4, 17, 0, 3, 128, 2] => [4 + 256*2, 17 + 256*2, 0 + 256*2] => [516, 529, 512]

There are some more complications, especially for doubles, but I assume this is enough for anybody who knows the format to recognize it.

Does anybody recognize this binary serialization/compression method? Does it have a name? Are there any C# libraries for working with this technique?

Erlend D.
  • 3,013
  • 6
  • 37
  • 59
  • Did you ask on Computer Science SE? – Fildor Nov 25 '19 at 08:03
  • No. Is this more relevant there? If so, should I delete it from here, or leave it both places? – Erlend D. Nov 25 '19 at 08:05
  • 1
    It's a form of run-length encoding. Though I don't understand the 128 stuff, because you say it means the following byte is repeated as is. Then what is the difference between [5, 0, 31] and [5, 128, 31] as both would mean [31, 31, 31, 31, 31]. – ckuri Nov 25 '19 at 08:16
  • @ckuri: Thanks! [5, 0, 31] is [31], while [5, 128, 31] is [31, 31, 31, 31, 31], as the second byte being smaller than 128 means that the bytes will be listed directly, as opposed to listed once and repeated 5 times. But these would encode the same array: [5, 0, 31, 31, 31, 31, 31] and [5, 128, 31]. – Erlend D. Nov 25 '19 at 08:19

0 Answers0