4

I am working on compressing an arbitrary vector with MATLAB, which provides factory methods for Huffman Coding: huffmandict, huffmanenco, huffmandeco.

The huffmandict function produces a lookup table mapping each symbol in the signal we want to encode to its corresponding codeword which is needed to encode and then decode the signal.

It is trivial to generate the dictionary when you know the input vector. But say I'm compressing to send from Alice to Bob - I can't assume Bob knows the dictionary too - so Alice needs to send the dictionary along with the huffman code!

Is there a way in MATLAB of generating a bitstream representation of the dictionary to be prepended to our huffman code to allow for it to be decoded at the other end?

What I'm thinking is the resulting code looks like if N is the length of the encoded dictionary:

(N encoded as 8 bits)(huffman dict encoded in N bits)(huffman code)

It seems odds that MATLAB provides quite powerful factory methods for the encoding but then does not even bother to make it actually usable in digital transmission with a lot of extra work.

I understand that in the theory, a huffman tree is often built - is there a way to generate this in MATLAB, and then convert such tree back to a dictionary?

Jay
  • 121
  • 1
  • 4
  • 1
    Not the most efficient method, but its easy to save the dictionary as a file then read it back into memory. Ex: `save('dict.mat','dict','-v6'); f = fopen('dict.mat','rb'); data = uint8(fread(f)); fclose(f);` then transmit `data` to the reciever. On the receiving end write back to file and load: `f = fopen('dict.mat','wb'); fwrite(data); load('dict.mat'); fclose(f);` – jodag Jan 07 '18 at 19:48
  • How about `jsonencode` / `jsondecode`? – nekomatic Jan 08 '18 at 13:23

1 Answers1

1

I know of two efficient code expression methods used in JPEG and gzip but as I understand they require the dictionary to be canonical, meaning that every branch on the right side (starting with 1) have to be longer. So you have to convert the code to a canonical form since there are 2^n (n being the number of codewords) designs. Canonical has the same expected length. Then you can express each symbol by the length of its branch, limiting to a reasonable number like 2^4 (meaning 4 bits for each symbol). Ok, let's get to the code, for the vector to be sent:

for i = 1:size(dict,1)
    L(i) = numel(dict{i,2})
end

In the receiving side you have to do a little more (I assume there is some fixed order in your codewords labels):

k = 0;
for l = 1:16
    k = k * 2;
    for j = find(L==l)
        d{j,1} = j;
        d{j,2} = de2bi(k, 'left-msb', l);
        k = k + 1;
    end
end

For converting to canonical form you need only to convert your tree to vector format and back.

mnz
  • 156
  • 6