Encoding more than 256 symbols with Arithmetic Coding

Question

I am trying to encode signed values ranging from -256 <-> 255 (i.e. 9 bit data represented by short) with arithmetic encoder, however I have discovered that existing implementations of Arithmetic coding (such as dlib, rANS) usually reads the file in the form of a string and treat the data as 8-bit.

The problem with this technique is that this splitting of signed data (shown in 3) in the form of string destroys the underlying histogram (shown in 4). I believe that such splitting may also be degrading the compression ratios (but I may be wrong).

I tested my hypothesis by implementing a Huffman encoding with 8-bit and 16-bit data and found that I was right, this maybe due to Huffman's dependence on making the tree by using probabilities.

(EDITED)My question is: How to encode/model symbols (which cannot be contained in a conventional 8-bit container) so that resulting symbols can be easily compressed with traditional arithmetic compressor implementations without affecting compression ratios.

Signed histogram:

Signed histogram

Splitted histogram:

Splitted histogram

Hi Asif, are you asking for someone to give you an algorithm here? Or something else? It's a little unclear. — TylerH, Feb 14 '18 at 14:49
Are you asking for code, do you want an algorithm, or are you asking for a library? — NathanOliver, Feb 14 '18 at 14:49
Why not expand your 9-bit data to 16-bit? you'll get a lot of `0x00` and `0x01` 's which wouldn't impact the compression (very much.... i think... maybe).... please tell if this makes no sense XD — Stefan, Feb 14 '18 at 14:52
@Stefan, I have already mentioned that I tested this hypothesis with Huffman encoding and the results were drastic For eg. splitting the short in two bytes increases the size of overall data-to-be-compressed hence the compression ratio is significantly affected. — Asif Ali, Feb 14 '18 at 14:56
Sorry to say but....: Questions asking us to recommend or find a book, tool, software library, tutorial or other off-site resource are off-topic for Stack Overflow as they tend to attract opinionated answers and spam. — Stefan, Feb 14 '18 at 14:56
@Stefan, sorry if this might seem like asking for a recommendation. In my defense, I believe that question is more about finding the solution of problem than a recommendation (as recommendation implies that there exists some possible solutions). — Asif Ali, Feb 14 '18 at 14:59
I agree it is a valid question (maybe you could finetune the form a bit). Lets see what happens. — Stefan, Feb 14 '18 at 15:07
@AsifAli Maybe zig-zag encoding to get unsigned values with nice distribution (0 = 0, -1 = 1, 1 = 2, -2 = 3, 2 = 4 ...) and then an order-1 model with 3 probability tables (one for MSB, one for LSB following a 0, one for LSB following a 1). | IIRC FastAC can handle more than 256 symbols. — Dan Mašek, Feb 14 '18 at 21:11
@AsifAli A [quick draft](https://pastebin.com/Unap8Vbz) of using FastAC either directly coding zig-zag mapped 9-bit values or a split approach mentioned above. Both approaches seem to result in similar compression ratio. Given that, it seems you can do well enough with just a <= 256 symbol alphabets. Maybe rephrase the question so it's not asking for a library and I can write up an answer. — Dan Mašek, Feb 14 '18 at 22:35
@AsifAli [Update](https://pastebin.com/D0PLGmGu) with an approach coding 2 symbol sign and 2x 256 symbol abs value. — Dan Mašek, Feb 14 '18 at 23:11
Nothing about the dlib entropy encoder restricts the set of symbols to 8bit numbers. See the interface: http://dlib.net/dlib/entropy_encoder/entropy_encoder_kernel_abstract.h.html. There is no commitment to any particular symbol set size. — Davis King, Feb 15 '18 at 02:40
Thanks @DanMašek for extensive answer, I will test the provided code. Furthermore, I have edited the question to a specific problem and hoping that the flag of "on-hold" will be removed soon. Looking forward to the further explanation of quick draft in answer. — Asif Ali, Feb 15 '18 at 10:25
@DavisKing, I have already tried to use 16-bit compressor from DLIB. In theory, the library is capable of handling both 9-bit and 16-bit compression mode but the implementation of both cases is left un-tested/un-implemented (see https://github.com/davisking/dlib/blob/master/dlib/compress_stream.h line 65 onwards). I already tried to define my own kernel_type (to utilize either 9-bit or 16-bit symbols) but failed miserably (i believe my programming skills are to blame). — Asif Ali, Feb 15 '18 at 10:43
It's not untested and unimplemented. Even that file you linked to uses multiple alphabet sizes, including 513, 65534, and 65537. All these modes are regularly unit tested. Read the documentation. — Davis King, Feb 15 '18 at 11:40
Dear @DavisKing, my apologies if it seemed like I didn't read the documentation but I did. If you look at the http://dlib.net/compression.html various kernels have been benchmarked but none of them used _fcX_length_, _fcX_length_2 or _fcX_index_ (for 9, 16 and 15 bits respectively) Could you kindly point me to the file in which test and/or implementation of such bits data are presented, thanks in advance. — Asif Ali, Feb 15 '18 at 14:45

Encoding more than 256 symbols with Arithmetic Coding

0 Answers0