0

I have a data stream of bytes, also known as radix 256 symbols. What is the best algorithm to convert that, ideally on the fly, to a new stream of symbols where the radix of each symbol varies and is only known at runtime? The lengths of the input byte stream and the target radix list are both long but finite. All non-negative integers, no floating point. Additionally, the target radix cannot be guaranteed to evenly divide or be a multiple of 256.

Reinderien
  • 11,755
  • 5
  • 49
  • 77
  • does the output stream need to have any special properties (like be a number in some way), or do you just need to be able to get the original stream back from the output stream and the radix list? – Matt Timmermans Nov 03 '15 at 13:35
  • @MattTimmermans Basically, separate non-negative integers of the specified radix. And yes, the original stream must later be restored. – Reinderien Nov 03 '15 at 13:56

1 Answers1

1

Your problem is a subset of Arithmetic Encoding, which is used as the last stage in many compression algorithms. It's one of the coolest things to learn in CS:

http://www.drdobbs.com/cpp/data-compression-with-arithmetic-encodin/240169251 https://en.wikipedia.org/wiki/Arithmetic_coding

How your problem specifically relates:

The encoder you want is an arithmetic decoder, and for each decoding you will use a different sized alphabet (the radix) with equal probabilities for all symbols.

The main loop of your encoder will do something like this:

int val=0; //information from the stream
int range=1; //val is in [0,range)
while(...)
{
    int radix = next_radix();
    //ensure adequate efficiency
    while(range < radix*256)
    {
        val = (val<<8)|(next_byte()&255);
        range<<=8;
    }
    int output = (int)(radix*(long)val/range);
    //find the smallest possible val that produces this output
    int low = (int)((output*(long)range+radix-1)/radix);
    //find the smallest possible val that produces the next output
    int high = (int)(((output+1)*(long)range+radix-1)/radix);
    val-=low;
    range = high-low;
    write(output);
} 

There are complications with handling the termination conditions and handling carries in your decoder (the arithmetic encoder), so you'll have to read the literature, starting with the stuff I linked. I hope this gives you the flavour of how it works, though.

Good luck

Matt Timmermans
  • 53,709
  • 3
  • 46
  • 87
  • Does this: `(out*self._range + radix - 1)/radix` not evaluate to this? `(out*self.range - 1)/radix + 1`? Similarly, `((out + 1)*self._range + radix - 1)/radix` = `((out + 1)*self._range - 1)/radix + 1` – Reinderien Nov 03 '15 at 17:16
  • the first one is different when out==0, because integer division truncates instead of rounding down. In case you don't recognize this trick, for POSITIVE integer division, (A+B-1)/B is A/B rounded up, (A+(B>>1))/B is A/B rounded to nearest, and A/B is A/B rounded down. – Matt Timmermans Nov 03 '15 at 22:18