2

I have never done compression but am interested in the Huffman encoding. They show this as a simple demo encoding for the first few letters:

A     0
E     10
P     110
space 1110
D     11110
T     111110
L     111111

The standard Huffman encoding you see otherwise has a different set of codes, but it doesn't matter for this question. What I'm wondering is how to most efficiently manipulate these bits in JavaScript. It is said that you should deal with things in chunks of 8, 16, or 32, but nothing else really, because this is how integers and values are stored in the computer architecture. So the way I understand it is you should probably read 8-bit chunks of the input at a time. I'm not exactly sure how to do this, but I think if you did this it would work:

var bytes = new Uint8Array(array)
var byte1 = bytes[0]
var byte2 = bytes[1]
...

This seems like the most efficient way to access the data. But there is an alternative I'm thinking about which I wanted to clarify. You could instead just convert the input to a binary text string, so a string of 1's and 0's, as in

var string = integerOrByteArray.toString(2)

But the way I've learned, converting anything to string is a performance hit. So it seems you should avoid converting to strings if possible.

So if that's the case, then we are left with the first approach with Uint8Array (or Uint32Array, etc.). I'm wondering how you would then split the value into the component parts efficiently/ideally. So if we had this....

010110
AEP

....and we did our integer thing, then we might load some 8-bit integer like one of these:

01011000
01011001
00101100
...

So it's like, we need to join (potentially) any front data that might be part of the last 8-bit chunk, and then split the remaining for the characters. My question is basically what the recommended way of doing this. I can come up with ways of doing it but they all seem rather complicated so far.

Lance
  • 75,200
  • 93
  • 289
  • 503

2 Answers2

3

This actually interacts with the "rest" of Huffman decompression. What exactly you need here depends on whether you intend to do efficient table based decoding, or bit-by-bit tree-walking. The input cannot be split without decoding, because you only find the length of the code after decoding what symbol it represents. After decoding there is not a lot in point in splitting, so really what we end up with is just a Huffman decoder and not a bit-string splitter.

For bit-by-bit tree-walking, all you need is some way to access any particular bit (given its index) from the byte array. You could also use the technique below with a block size of 1 bit.

For more efficient decoding, you need a buffer from which you can extract a block of bits as long as your pre-defined maximum codelength, say 15 bits or so[1]. The specifics depends on the order in which yours codes are packed into bytes, that is, whether the bytes are filled lsb-to-msb or msb-to-lsb, and on where in your buffer variable you want to keep the bits. For example, here I keep the bits in the buffer near the lsb of the buffer, and assume that if a code is split over two bytes then it is in the lsb of the first byte and the msb of the second byte[2] (not tested):

var rawindex = 0;
var buffer = 0;
var nbits = 0;
var done = false;
var blockmask = (1 << MAX_CODELEN) - 1;
while (!done) {
    // refill buffer
    while (nbits < MAX_CODELEN && rawindex < data.length) {
        buffer = (buffer << 8) | data[rawindex++];
        nbits += 8;
    }
    if (nbits < MAX_CODELEN) {
        // this can happen at the end of the data
        buffer <<= MAX_CODELEN - nbits;
        nbits = MAX_CODELEN;
    }
    // get block from buffer
    var block = (buffer >> (nbits - MAX_CODELEN)) & blockmask;
    // decode by table lookup
    var sym = table[block];
    // drop only bits that really belong to the symbol
    nbits -= bitlengthOf(sym);
    ...
    // use the symbol somehow
}

This shows the simplest table-based decoding strategy, just a plain lookup. The symbol/length pair could be an object or stored in two separate Uint8Arrays or encoded into a single Uint16Array, that sort of thing. Building the table is simple, for example in pseudocode:

# for each symbol/code do this:
bottomSize = maxCodeLen - codeLen
topBits = code << bottomSize
for bottom in inclusive_range(0, (1 << bottomSize) - 1):
    table[topBits | bottom] = (symbol, codeLen)

Variants

Packing the codes into bytes from lsb up changes the flow of bits. To reassemble a code in the buffer, the bits need to come in from the high side of the buffer and leave from the bottom:

// refill step
buffer |= data[rawindex++] << nbits;
nbits += 8;
...
// get block to decode
var block = buffer & blockmask;
// decode by table lookup
var sym = table[block];
// drop bits
buffer >>= getlengthOf(sym);

The table is different too, now the padding is in the high bits of the table index, spreading out the entries belonging to a single symbol instead of putting them in a contiguous range (not tested, showing bit-packed table entries with 5-bit code length):

// for each symbol/code:
var paddingCount = MAX_CODELEN - codeLen;
for (var padding = 0; padding < (1 << paddingCount); padding++)
    table[(padding << codelen) | code] = (symbol << 5) + codeLen;

[1]: a long maximum code length makes the decoding table very big, and a MAX_CODELEN > 25 risks overflowing the buffer. There are ways around that, but super long symbols are not very useful anyway.

[2]: this is not what DELFATE does.

harold
  • 61,398
  • 6
  • 86
  • 164
  • "This actually interacts with the "rest" of Huffman decompression." not sure what you mean by that. and by "this is not what DELFATE does" not sure if you mean this is a good thing or just different. – Lance Feb 13 '19 at 06:23
  • @LancePollard I just mean it's not a separable problem. We cannot split the bit stream into codes and then decode them. These steps are inherently linked, and in a way that changes the problem: there is really no splitting of the codes at all. The chunks read from the buffer are not codes, they usually include extra trailing bits. DEFLATE packs bytes the other way around, I can show you that too – harold Feb 13 '19 at 06:42
1

You have an excellent answer for reading bits already.

For completeness and in case you want to look into compression as well, here's an (untested) output function that may help with some ideas:

let remainder = 0; 
let remainderBits = 0;  // number of bits held in remainder

function putHuffman( 
    code,       // non-negative huffman code
    nBits) {    // bit length of code

    if( remainderBits) {
        code = code * (2 ** remainderBits) + remainder;
        nBits += remainderBits;
    }
    while( nBits >= 8) {
        putOctet( code % 256);
        code /= 256;
        nBits -=8;
    }
    remainder = code;
    remainderBits = nBits;
}

function putOctet( byte) {
    // add byte to the output stream
}

It could be converted to use bit shift operators but currently allows up to about 46 bits in a code - if 7 left over bits are added, the bit count reaches 53, the maximum bit precision of mantissa values in double floats.

Of course JavaScript is not well suited to intensive bit operations given it lacks an integer data type - using floating point multiplication does not appear to be significantly slower than left shifting, if it is indeed slower.

traktor
  • 17,588
  • 4
  • 32
  • 53